【HuggingFace LLM】经典NLP Tasks数据流转

创建Transformer

python 复制代码

from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")

from_pretrained 方法从Hugging Face Hub下载并缓存模型数据。如前所述，检查点名称对应于特定的模型架构和权重，在本例中是具有基本架构（12 层、768 个隐藏大小、12 个注意力头）和大小写输入（意味着大写/小写区分很重要）的 BERT 模型。

加载与保存

python 复制代码

model.save_pretrained("directory_on_my_computer")

保存到本地config.json和model.safetensor。

其中：

config.json是构建模型架构所需的所有必要属性；
model.safetensor是一个状态字典，包含了模型的所有权重；

文本编码

文本编码后，通常经过分词器得到一段话对应的数字列表。

python 复制代码

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encoded_input = tokenizer("Hello, I'm a single sentence!")
>>>encoded_input

{'input_ids': [101, 8667, 117, 1000, 1045, 1005, 1049, 2235, 17662, 12172, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

其中input_ids是tokens对应的数字，token_type_ids是指tokens来自哪句话，attention_mask指模型应该关注哪个tokens

通常也可以通过tokenizer.decode来解析一个token列表为原话。

python 复制代码

>>>tokenizer.decode(encoded_input["input_ids"])
"[CLS] Hello, I'm a single sentence! [SEP]"

其中[cls][sep]这类被称作特殊tokens，当模型预训练时使用 了这些特殊tokens，那么分词器中就需要添加，否则模型也可以不输入他们。

也可以同时输入（batch）多个句子同时进行编码

python 复制代码

encoded_input = tokenizer("How are you?", "I'm fine, thank you!")
print(encoded_input)
{'input_ids': [[101, 1731, 1132, 1128, 136, 102], [101, 1045, 1005, 1049, 2503, 117, 5763, 1128, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
 
encoded_input = tokenizer("How are you?", "I'm fine, thank you!", return_tensors="pt")
print(encoded_input)
{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102],
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

当返回return="pt"时，由于不同句子之间的字数不同，需要用到padding方法补充至相同长度。

Padding

python 复制代码

encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"], padding=True, return_tensors="pt"
)
print(encoded_input)

{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102,     0,     0,     0,     0],
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

在较短的句子处补0，并在attention_mask中备注0表示这部分是padding的没有实际意义。

如果是手动传入多个input_ids制作张量时，可以使用tokenizer.pad_token_id进行填充

python 复制代码

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

可以发现，缺失一个填充值在批输入时有不同的结果 。这是由于模型中Attention部分会把所有tokens纳入考虑，因此需要使用Attention_mask告诉Attention层这些tokens不需要管。

python 复制代码

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)

添加Attention_mask后，得到与直接输入相同的结果。

Truncating

当张量过长，超过预训练时最大的序列长度时，模型会无法处理，需要及时截断。

python 复制代码

encoded_input = tokenizer(
    "This is a very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very long sentence.",
    truncation=True,
)

通过truncation=True来截断输入的最大size。

可以通过结合填充（padding）和截断（truncating）参数来控制输出张量形式。

python 复制代码

encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"],
    padding=True,
    truncation=True,
    max_length=5,
    return_tensors="pt",
)
print(encoded_input)

{'input_ids': tensor([[  101,  1731,  1132,  1128,   102],
         [  101,  1045,  1005,  1049,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1]])}

Special Tokens

python 复制代码

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
"i've been waiting for a huggingface course my whole life."

如果手动的将输入文本进行分词-> 手动将分词进行tokens转化。那么这种方法不会自动的加上 [cls]/[sep]等特殊的tokens，而在Bert等系列中，这些特殊tokens在预训练中已被加入，因此需要自己手动添加特殊tokens。

但若使用tokenizer.tokenize()方法，模型自动识别 checkpoint是否需要特殊tokens并自动添加

编码API

python 复制代码

sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)

首先，它可以对单个序列进行分词化

python 复制代码

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
model_inputs = tokenizer(sequences)

它还能同时处理多个序列，且 API 不变

python 复制代码

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

还可以根据多个目标来填充

python 复制代码

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

还可以截断序列，并且可以和最大长度混用。

python 复制代码

sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

API还可以指定返回的列表格式是tensor还是numpy