HuggingFace学习记录

BertTokenizer分词和编码

5个字有7个编码，原因为头和尾分别有个 $cls$ 及 $sep$ ， $CLS$ 和 $SEP$ 是 BERT 中的两个特殊标记符号，在 BERT 的输入文本中起到特殊的作用。

$CLS$ 是 "classification" 的缩写，在文本分类任务中，它通常表示句子或文档的开头。在 BERT 中， $CLS$ 对应着输入文本中第一个词的词向量，输出层中的第一个神经元通常会被用来预测文本的类别。
$SEP$ 是 "separator" 的缩写，它通常表示句子或文档的结尾。在 BERT 中， $SEP$ 对应着输入文本中最后一个词的词向量，它的作用是用来分割不同的句子。例如，在 BERT 中处理句子对时，两个句子之间通常会插入一个 $SEP$ 来表示它们的分界点。
$UNK$ 标志指的是未知字符
$MASK$ 标志用于遮盖句子中的一些单词，将单词用 $MASK$ 遮盖之后，再利用 BERT 输出的 $MASK$ 向量预测单词是什么。

单句分词：

py 复制代码

tokenizer = BertTokenizer.from_pretrained('./huggingface/bert-base-chinese')
# 分词并编码
token = tokenizer.encode('北京欢迎你')
print(token)
# [101, 1266, 776, 3614, 6816, 872, 102]

# 简写形式
token = tokenizer(['北京欢迎你', '为你开天辟地'], padding=True, return_tensors='pt')
{'input_ids': tensor([[ 101, 1266,  776, 3614, 6816,  872,  102,    0],
        [ 101,  711,  872, 2458, 1921, 6792, 1765,  102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1]])}

# 解码
print(tokenizer.decode([101, 1266, 776, 3614, 6816, 872, 102]))

# 查看特殊标记
print(tokenizer.special_tokens_map)

# 查看特殊标记对应id
print(tokenizer.encode(['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]'], add_special_tokens=False))
# [100, 102, 0, 101, 103]

批处理

padding = True，按最长的字段进行填充，然后使用return_tensors转化为tensor的格式，作为模型的输入，给pytorch作为输入

py 复制代码

# 等长填充
batch_token1 = tokenizer(['北京欢迎你', '为你开天辟地'], padding=True, return_tensors='pt')
print(batch_token1)
print(batch_token1['input_ids'])

{'input_ids': tensor( $\[ 101, 1266, 776, 3614, 6816, 872, 102, 0$ , $101, 711, 872, 2458, 1921, 6792, 1765, 102$ ]), 'token_type_ids': tensor( $\[0, 0, 0, 0, 0, 0, 0, 0$ , $0, 0, 0, 0, 0, 0, 0, 0$ ]), 'attention_mask': tensor( $\[1, 1, 1, 1, 1, 1, 1, 0$ , $1, 1, 1, 1, 1, 1, 1, 1$ ])} tensor( $\[ 101, 1266, 776, 3614, 6816, 872, 102, 0$ , $101, 711, 872, 2458, 1921, 6792, 1765, 102$ ])

其中：token_type_ids指代的第几个句子。例如，考虑以下两个句子：句子1： "What is the weather like today?" 句子2： "Will it rain later?"

在经过预处理后，这两个句子可能被编码为以下tokens： [CLS], What, is, the, weather, like, today, ?, [SEP], Will, it, rain, later, ?, [SEP]

在这个例子中，[CLS]和[SEP]是特殊的tokens，[CLS]用于表示句子的开头，[SEP]用于分隔两个句子。token_type_ids列表的对应值为： [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

句子太长的情况下可以使用截取。比如下方就为截取5个字符（包含CLS【101】与SEP【102】）

py 复制代码

# 截断
batch_token2 = tokenizer(['北京欢迎你', '为你开天辟地'], max_length=5, truncation=True)
print(batch_token2)

{'input_ids': $\[101, 1266, 776, 3614, 102$ , $101, 711, 872, 2458, 102$ ], 'token_type_ids': $\[0, 0, 0, 0, 0$ , $0, 0, 0, 0, 0$ ], 'attention_mask': $\[1, 1, 1, 1, 1$ , $1, 1, 1, 1, 1$ ]}

填充到固定的长度,padding='max_length'传参，填充到最大长度

py 复制代码

# 填充到指定长度，超过的截断
batch_token3 = tokenizer(['北京欢迎你', '为你开天辟地'], max_length=10, truncation=True, padding='max_length')
print(batch_token3)

{'input_ids': $\[101, 1266, 776, 3614, 6816, 872, 102, 0, 0, 0$ , $101, 711, 872, 2458, 1921, 6792, 1765, 102, 0, 0$ ], 'token_type_ids': $\[0, 0, 0, 0, 0, 0, 0, 0, 0, 0$ , $0, 0, 0, 0, 0, 0, 0, 0, 0, 0$ ], 'attention_mask': $\[1, 1, 1, 1, 1, 1, 1, 0, 0, 0$ , $1, 1, 1, 1, 1, 1, 1, 1, 0, 0$ ]}

词向量编码

py 复制代码

from transformers import BertModel
from transformers import logging
# 启动的时候会有一些警告，使用log屏蔽掉，防止影响调试
logging.set_verbosity_error()

model = BertModel.from_pretrained('./huggingface/bert-base-chinese')
encoded = model(batch_token1['input_ids'])
print(encoded)

encoded_text = encoded[0]
print(encoded_text.shape)

last_hidden_state 为隐层，即词向量，pooler_outputer为做分类用的，后面的就是几个状态。我们要的就是第一个参数（编码结构），即encoded $0$ ，shape为（2,8,768）指代为，2个句子【2个batch】、每个句子填充长度为8个、bert编码之后的768维的向量。

css 复制代码

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.2815,  0.6079,  0.3920,  ...,  0.4682,  0.1664, -0.1104],
         [-0.4996,  0.4137,  0.4482,  ..., -0.5986, -0.3632, -0.0424],
         [ 0.0472,  0.4009, -0.2222,  ...,  0.1105,  0.5548, -0.1777],
         ...,
         [ 0.6923,  0.5521, -0.2580,  ...,  0.0042,  0.4254, -0.6365],
         [-0.3318,  0.3553,  0.4314,  ...,  0.0181, -0.1999, -0.2506],
         [-0.2052,  0.2994, -0.0189,  ..., -0.0735, -0.3766, -0.4286]],

        [[ 0.2887,  0.6017,  0.4943,  ...,  0.0903,  0.0543, -0.1163],
         [-0.2048,  0.5193,  0.9473,  ..., -0.8814, -0.5178,  0.1631],
         [ 0.7151,  0.0340, -0.4089,  ..., -0.2059, -0.1003, -0.5724],
         ...,
         [ 0.6159, -0.1950,  0.9022,  ..., -0.5146,  0.6748, -1.2145],
         [ 1.2560,  0.3676,  0.1448,  ..., -0.3056,  0.2488,  0.1433],
         [-0.2018,  0.2100,  0.3642,  ..., -0.7199,  0.0571, -0.2698]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[ 0.9986,  1.0000,  0.9751,  ..., -0.9984, -0.9446,  0.9237],
        [ 0.9998,  1.0000,  0.9957,  ..., -0.9992, -0.9960,  0.9914]],
       grad_fn=<TanhBackward0>), hidden_states=None, past_key_values=None, attentions=None, cross_attentions=None)
torch.Size([2, 8, 768])

BertTokenizer分词不可逆问题

目前还没有支持中英文混搭的预训练模型，直接用中文模型进行编码，英文部分会变成 $UNK$ 标记，进而导致BertTokenizer分词不可逆问题。

使用offset_mapping方法，从原始文本中找到关系解决该问题

python 复制代码

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('./huggingface/bert-base-chinese')

text = '周杰伦Jay专辑'

print(tokenizer.tokenize(text))
exit()

tokenize()方法返回分词分完之后的结果： $'周', '杰', '伦', '\[UNK$ ', '专', '辑']

scss 复制代码

tokened = tokenizer(text)
input_ids = tokened['input_ids']
print(input_ids)

转化为Id后： $101, 1453, 3345, 840, 100, 683, 6782, 102$

ini 复制代码

# subject实体
sub_pos = [1, 3] # 周杰伦
sub_ids = [id for k,id in enumerate(input_ids) if k>=sub_pos[0] and k<=sub_pos[1]]
print(sub_ids)
sub_text = tokenizer.decode(sub_ids).replace(' ', '')
print(sub_text)

在id在遍历input_ids的情况下，1<= k <= 3，对于第4行的replace是因为转化后会生成空格=》周_杰_轮

$1453, 3345, 840$ 周杰伦

ini 复制代码

# object 实体
obj_pos = [4, 4] # Forever
obj_ids = [id for k,id in enumerate(input_ids) if k>=obj_pos[0] and k<=obj_pos[1]]
print(obj_ids)
obj_text = tokenizer.decode(obj_ids).replace(' ', '')
print(obj_text)

经过编码之后生成id后再反推找不到原文内容，需要解决数据丢失的问题

100\] \[UNK

从原始文本中找实体

ini 复制代码

from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('./huggingface/bert-base-chinese')
text = '周杰伦Jay专辑'

tokened = tokenizer(text, return_offsets_mapping=True) # 第二个参数为了把分词之后的位置也返回
print(tokened)
offset_mapping = tokened['offset_mapping']
head, tail = offset_mapping[4]
print(text[head:tail])

多了一个offset_mapping，表分词的位置。

{'input_ids': $101, 1453, 3345, 840, 100, 683, 6782, 102$ , 'token_type_ids': $0, 0, 0, 0, 0, 0, 0, 0$ , 'attention_mask': $1, 1, 1, 1, 1, 1, 1, 1$ , 'offset_mapping': $(0, 0), (0, 1), (1, 2), (2, 3), (3, 6), (6, 7), (7, 8), (0, 0)$ } Jay

NLP实践-HugginFace学习记录

HuggingFace学习记录

BertTokenizer分词和编码

单句分词：

批处理

padding = True，按最长的字段进行填充，然后使用return_tensors转化为tensor的格式，作为模型的输入，给pytorch作为输入

句子太长的情况下可以使用截取。比如下方就为截取5个字符（包含CLS【101】与SEP【102】）

填充到固定的长度,padding='max_length'传参，填充到最大长度

词向量编码

BertTokenizer分词不可逆问题

从原始文本中找实体