前言

使用 Transorfmer 的 Decoder 从头构建 GPT 模型，并且使用不同采样算法完成文本生成任务。

数据处理

从指定的 URL 下载名为 simplebooks.zip 的文件，并在本地解压，这里的数据集包含了一些简单的文本数据。
使用 tf.data.TextLineDataset 创建了训练和验证的数据集。对于训练集，使用filter函数排除长度小于 MIN_STRING_LEN 的文本，并进行批处理和打乱操作，验证集只进行了长度过滤和批处理。
使用 keras_nlp.tokenizers.compute_word_piece_vocabulary 函数构建词汇表。这个函数接受训练集数据，指定词汇表大小、是否小写以及保留的特殊标记等参数。
使用 keras_nlp.tokenizers.WordPieceTokenizer 设置 Tokenizer ，传入前面构建的词汇表和序列长度等参数。
使用 keras_nlp.layers.StartEndPacker 设置 StartEndPacker ，指定序列长度和开始标记的 ID 。

ini 复制代码

tf.keras.utils.get_file(origin="https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip", extract=True)
dir = os.path.expanduser("~/.keras/datasets/simplebooks/")
raw_train_ds = (tf.data.TextLineDataset(dir + "simplebooks-92-raw/train.txt").filter(lambda x: tf.strings.length(x) > MIN_STRING_LEN).batch(BATCH_SIZE).shuffle(buffer_size=BATCH_SIZE * 4))
raw_val_ds = tf.data.TextLineDataset(dir + "simplebooks-92-raw/valid.txt").filter(lambda x:tf.strings.length(x)>MIN_STRING_LEN).batch(BATCH_SIZE)
vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(raw_train_ds, vocabulary_size=VOCAB_SIZE, lowercase=True, reserved_tokens=["[PAD]", "[UNK]", "[BOS]"])
tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(vocabulary=vocab, sequence_length=SEQ_LEN, lowercase=True)
start_packer = keras_nlp.layers.StartEndPacker(sequence_length=SEQ_LEN, start_value=tokenizer.token_to_id("[BOS]"))
def preprocess(inputs):
    outputs = tokenizer(inputs)
    features = start_packer(outputs)
    labels = outputs
    return features, labels
train_ds = raw_train_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)
val_ds = raw_val_ds.map(preprocess, num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)

模型训练

使用 tf.keras.layers.Input 定义了一个输入层，输入的数据是整数序列，也就是文本的 token id 序列。
使用 keras_nlp.layers.TokenAndPositionEmbedding 创建了一个嵌入层，用于将输入的整数序列嵌入到高维空间。这个嵌入层考虑了词汇表的大小、序列长度、嵌入维度等参数，并且支持对序列中的特殊标记进行 mask 。
使用 keras_nlp.layers.TransformerDecoder 循环堆叠了 NUM_LAYERS 个解码器层，每个解码器层使用了指定的头数 num_heads 和中间维度 intermediate_dim。
使用全连接层 Dense 层生成最终的输出，输出维度为词汇表的大小。
使用 tf.keras.Model 定义了整个模型，指定了输入和输出。使用SparseCategoricalCrossentropy 作为损失函数，Adam 作为优化器，同时使用自定义的 Perplexity 指标来度量。
使用 model.fit 进行训练，输入训练集和验证集，指定训练轮数。

ini 复制代码

inputs = tf.keras.layers.Input(shape=(None,), dtype="int32")
embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(vocabulary_size=VOCAB_SIZE, sequence_length=SEQ_LEN, embedding_dim=EMBED_DIM, mask_zero=True)
x = embedding_layer(inputs)
for _ in range(NUM_LAYERS):
    decoder_layer = keras_nlp.layers.TransformerDecoder(num_heads=NUM_HEADS, intermediate_dim=FEED_FORWARD_DIM)
    x = decoder_layer(x)
outputs = tf.keras.layers.Dense(VOCAB_SIZE)(x)
model = tf.keras.Model(inputs=inputs, outputs=outputs)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
perplexity = keras_nlp.metrics.Perplexity(from_logits=True, mask_token_id=0)
model.compile(optimizer="adam", loss=loss_fn, metrics=[perplexity])
model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS)

日志打印：

yaml 复制代码

Epoch 1/5
2445/2445 [==============================] - 82s 30ms/step - loss: 4.7067 - perplexity: 111.3245 - val_loss: 4.1930 - val_perplexity: 67.9194
Epoch 2/5
2445/2445 [==============================] - 77s 30ms/step - loss: 4.1393 - perplexity: 63.0901 - val_loss: 4.0543 - val_perplexity: 59.1305
Epoch 3/5
2445/2445 [==============================] - 94s 35ms/step - loss: 4.0120 - perplexity: 55.5379 - val_loss: 4.0131 - val_perplexity: 56.7052
Epoch 4/5
2445/2445 [==============================] - 93s 35ms/step - loss: 3.9460 - perplexity: 51.9870 - val_loss: 3.9597 - val_perplexity: 53.7640
Epoch 5/5
2445/2445 [==============================] - 96s 36ms/step - loss: 3.9003 - perplexity: 49.6596 - val_loss: 3.9651 - val_perplexity: 54.0120

不同采样算法的生成结果

GreedySampler

即局部最优的贪心搜索方法，每一步都选择概率最大的 token 进行输出，最后组成整个句子。

vbnet 复制代码

Greedy search generated text:
[b'[BOS] " i am glad to see you , " said the captain , " and i am glad to see you , for i have been so glad to see you , and i have been so glad to see you . i have been thinking of that i have been so much of you , and i have been so glad to see you , and i have been so glad to see you . i have been thinking of you , and i have been thinking of you , and i have been thinking of you , and i have been thinking of you , and i have been thinking of you , and i have been thinking of you , and i have been']

BeamSampler

Beam Search 在每个时间步保留 beam size 个最大的可能取值路径。假设词典大小为 v ，beam size 为 k ，则每个时间步生成的候选路径有 k*v 个，保留其中概率最大的 k 个路径，再进入下一个时间步的搜索中。当 k=1 时，Beam Search 退化为 Greedy Search 。

vbnet 复制代码

Beam search generated text: 
[b'[BOS] " i don \' t think of it , " he said . " i don \' t want to know that , but i don \' t know what to do . i don \' t know what to do , but i don \' t know what to do . i don \' t know what to do , but i don \' t know what i want to do . i don \' t know what to do . i don \' t know what i want to do . i don \' t know what to do . i don \' t want to know what to do . i don \' t know what to do , but i don \' t']

RandomSampler

按照所有 token 的概率分布，每一步随机选择一个 token ，每个 token 的选择概率由其概率决定。

yaml 复制代码

Random  search generated text: 
[b"[BOS] and on the porch he could not get off , for the lamp she had been sitting on a table , mounted on the crest of griday , and , watcher , with their nose and auber pocket in humoniation , and blue indignant himself heartily , had no particular reason upon going in the first place . and on account of his own helping tom so . he knew the whole art of charleston , and had no pain of pains to follow his brother ' s boots , that one knew that such an accident was happening away from the schoolmaster . [PAD] ! whitevinne came across the"]

TopKSampler

Top-K 是直接挑选概率最高的 K 个单词，然后重新根据 softmax 计算这 K 个单词的概率，再根据概率分布情况进行采样，生成下一个单词。采样还可以选用 Temperature Sampling 方法。

vbnet 复制代码

Topk search generated text: 
[b'[BOS] but the spaniard of the day , with a corporal , came to the conclusion that the latter was the case . it was to be affected to recall . the superintender was to be adorned . in this case the acres of colony had been the case with the abroachus of praunctual , and the breach of the picycle - - that the thoroughfare was not in the chateau , and the superintendent was now a']

TopPSampler

Top-P Sampling 是预先设置一个概率界限 p 值，然后将所有可能取到的单词，根据概率大小从高到低排列，依次选取单词。当单词的累积概率和大于或等于 p 值时停止，然后从已经选取的单词中进行采样，生成下一个单词。采样同样可以选用 Temperature Sampling 方法。

arduino 复制代码

Topp search generated text: 
[b"[BOS] the general , of the enemy , and , with a small party of sailors , was a stout fellow - master , who was of the officers of the prussians , and his men were killed , as well as the major of the party of the prisoners , the french had gathered round their own . he was not the only skimos , but the french was at the same time . they had now returned to the admiral ' s portuguese , and , in the meantime , the english general , with his commander , had sent to the spanish admiral to pay the respects to the admiral . [PAD] had they been in command of the"]

从头构建 GPT 模型生成文本

前言

数据处理

模型训练

不同采样算法的生成结果

GreedySampler

BeamSampler

RandomSampler

TopKSampler

TopPSampler