基于RNN和Transformer的词级语言建模代码分析数据集的处理 Dictionary 和 Corpus

flyfish

Word-level Language Modeling using RNN and Transformer

word_language_model

PyTorch 提供的 word_language_model 示例展示了如何使用循环神经网络RNN(GRU或LSTM)和 Transformer 模型进行词级语言建模。默认情况下，训练使用Wikitext-2数据集,generate.py可以使用训练好的模型来生成新文本。

词级：表示语言模型的最小单位是单词（词），而不是字符或子词。

语言建模：是指创建一个模型来预测给定序列中下一个单词的可能性。

文件：data.py

token 是指文本的基本单位，例如单词、子词或字符。

tokenize 是将文本分割成这些基本单位的过程。

Dictionary 类用于存储单词与索引之间的映射关系。

Corpus 类用于读取文本文件并将其转换为索引序列，以便于进一步的处理和模型训练。

定义和使用 Dictionary 和 Corpus 类来处理文本数据，以便将文本数据转换为可以供语言模型训练使用的格式。具体来说，这段代码实现了以下几个功能：

构建词汇表（Dictionary 类）：

Dictionary 类用来存储单词到索引（word2idx）和索引到单词（idx2word）的映射。

add_word 方法用于向词汇表中添加新单词。如果单词已存在，则返回该单词的索引，否则将单词添加到 idx2word 列表并在 word2idx 字典中记录其索引。

处理文本数据（Corpus 类）：

复制代码

corpus
美['kɔːrpəs] 
英['kɔːpəs] 
n. 文集 / 全集 / <剖>体 / 语料库

Corpus 类用于读取和处理文本数据文件，生成用于训练、验证和测试的数据集。

在初始化时，Corpus 类读取指定路径下的 train.txt、valid.txt 和 test.txt 文件，并调用 tokenize 方法将每个文件转换为单词索引序列。

tokenize 方法首先将文件中的所有单词添加到词汇表中，然后将文件内容转换为对应的索引序列，并返回这些索引的 Tensor。

Dictionary测试

py 复制代码

class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)

# 创建 Dictionary 实例
vocab = Dictionary()

# 添加单词到词汇表
vocab.add_word("apple")
vocab.add_word("banana")
vocab.add_word("orange")

# 获取词汇表大小
vocab_size = len(vocab)
print("Vocabulary size:", vocab_size)  # 输出：Vocabulary size: 3

# 获取单词对应的索引
print("Index of 'apple':", vocab.word2idx["apple"])  # 输出：Index of 'apple': 0
print("Index of 'banana':", vocab.word2idx["banana"])  # 输出：Index of 'banana': 1
print("Index of 'orange':", vocab.word2idx["orange"])  # 输出：Index of 'orange': 2

# 获取索引对应的单词
print("Word at index 0:", vocab.idx2word[0])  # 输出：Word at index 0: apple
print("Word at index 1:", vocab.idx2word[1])  # 输出：Word at index 1: banana
print("Word at index 2:", vocab.idx2word[2])  # 输出：Word at index 2: orange

Corpus测试

py 复制代码

import os
import torch

class Dictionary(object):
    def __init__(self):
        self.word2idx = {}
        self.idx2word = []

    def add_word(self, word):
        if word not in self.word2idx:
            self.idx2word.append(word)
            self.word2idx[word] = len(self.idx2word) - 1
        return self.word2idx[word]

    def __len__(self):
        return len(self.idx2word)

class Corpus(object):
    def __init__(self, path):
        self.dictionary = Dictionary()
        self.train = self.tokenize(os.path.join(path, 'train.txt'))
        self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
        self.test = self.tokenize(os.path.join(path, 'test.txt'))

    def tokenize(self, path):
        """Tokenizes a text file."""
        assert os.path.exists(path)
        # Add words to the dictionary
        with open(path, 'r', encoding="utf8") as f:
            for line in f:
                words = line.split() + ['<eos>']
                for word in words:
                    self.dictionary.add_word(word)

        # Tokenize file content
        with open(path, 'r', encoding="utf8") as f:
            idss = []
            for line in f:
                words = line.split() + ['<eos>']
                ids = []
                for word in words:
                    ids.append(self.dictionary.word2idx[word])
                idss.append(torch.tensor(ids).type(torch.int64))
            ids = torch.cat(idss)

        return ids

# 使用示例
corpus_path = './data/wikitext-2'  # 替换为你的数据文件夹路径
corpus = Corpus(corpus_path)

# 打印词汇表中的一些单词和索引
print("Vocabulary size:", len(corpus.dictionary))
print("Index of '<eos>':", corpus.dictionary.word2idx['<eos>'])
print("Word at index 0:", corpus.dictionary.idx2word[0])

# 打印训练集、验证集和测试集的形状
print("Train data shape:", corpus.train.shape)
print("Validation data shape:", corpus.valid.shape)
print("Test data shape:", corpus.test.shape)


# 打印词汇表中的一些单词和索引
print("Vocabulary size:", len(corpus.dictionary))

# 打印前 10 个单词和它们的索引
print("First 10 words and their indices in the vocabulary:")
for i in range(10):
    word = corpus.dictionary.idx2word[i]
    idx = corpus.dictionary.word2idx[word]
    print(f"Word: {word}, Index: {idx}")

# 打印一些词汇表中的单词和它们的索引
print("\nSome word to index mappings:")
for word, idx in list(corpus.dictionary.word2idx.items())[:10]:
    print(f"Word: {word}, Index: {idx}")

输出结果

py 复制代码

Vocabulary size: 33278
Index of '<eos>': 0
Word at index 0: <eos>
Train data shape: torch.Size([2088628])
Validation data shape: torch.Size([217646])
Test data shape: torch.Size([245569])
Vocabulary size: 33278
First 10 words and their indices in the vocabulary:
Word: <eos>, Index: 0
Word: =, Index: 1
Word: Valkyria, Index: 2
Word: Chronicles, Index: 3
Word: III, Index: 4
Word: Senjō, Index: 5
Word: no, Index: 6
Word: 3, Index: 7
Word: :, Index: 8
Word: <unk>, Index: 9

Some word to index mappings:
Word: <eos>, Index: 0
Word: =, Index: 1
Word: Valkyria, Index: 2
Word: Chronicles, Index: 3
Word: III, Index: 4
Word: Senjō, Index: 5
Word: no, Index: 6
Word: 3, Index: 7
Word: :, Index: 8

假设我们有一个简单的文本文件 train.txt，其内容如下：

复制代码

hello world
this is a test

使用 tokenize 方法处理这个文件的过程如下：

构建词汇表：

读取第一行 "hello world"，将 "hello" 和 "world" 添加到词汇表，并加上（表示句子结束）。

读取第二行 "this is a test"，将 "this"、"is"、"a" 和 "test" 添加到词汇表，并加上。

最终的词汇表可能是：

复制代码

{
    'hello': 0,
    'world': 1,
    '<eos>': 2,
    'this': 3,
    'is': 4,
    'a': 5,
    'test': 6
}

将文本转换为索引序列：

读取第一行 "hello world"，转换为 [0, 1, 2]（即 "hello" 对应 0，"world" 对应 1，对应 2）。

读取第二行 "this is a test"，转换为 [3, 4, 5, 6, 2]（即 "this" 对应 3，"is" 对应 4，"a" 对应 5，"test" 对应 6，对应 2）。

最终的索引序列会被合并成一个大的 Tensor。

基于RNN和Transformer的词级语言建模 代码分析 数据集的处理 Dictionary 和 Corpus