【HuggingFace LLM】经典NLP Tasks数据流转

创建Transformer

python 复制代码
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")

from_pretrained 方法从Hugging Face Hub下载并缓存模型数据。如前所述,检查点名称对应于特定的模型架构和权重,在本例中是具有基本架构(12 层、768 个隐藏大小、12 个注意力头)和大小写输入(意味着大写/小写区分很重要)的 BERT 模型。

加载与保存

python 复制代码
model.save_pretrained("directory_on_my_computer")

保存到本地config.jsonmodel.safetensor

其中:

  • config.json是构建模型架构所需的所有必要属性;
  • model.safetensor是一个状态字典,包含了模型的所有权重;

文本编码

文本编码后,通常经过分词器得到一段话对应的数字列表。

python 复制代码
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encoded_input = tokenizer("Hello, I'm a single sentence!")
>>>encoded_input

{'input_ids': [101, 8667, 117, 1000, 1045, 1005, 1049, 2235, 17662, 12172, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

其中input_ids是tokens对应的数字,token_type_ids是指tokens来自哪句话,attention_mask指模型应该关注哪个tokens

通常也可以通过tokenizer.decode解析一个token列表为原话

python 复制代码
>>>tokenizer.decode(encoded_input["input_ids"])
"[CLS] Hello, I'm a single sentence! [SEP]"

其中[cls][sep]这类被称作特殊tokens,当模型预训练时使用 了这些特殊tokens,那么分词器中就需要添加,否则模型也可以不输入他们。

也可以同时输入(batch)多个句子同时进行编码

python 复制代码
encoded_input = tokenizer("How are you?", "I'm fine, thank you!")
print(encoded_input)
{'input_ids': [[101, 1731, 1132, 1128, 136, 102], [101, 1045, 1005, 1049, 2503, 117, 5763, 1128, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
 
encoded_input = tokenizer("How are you?", "I'm fine, thank you!", return_tensors="pt")
print(encoded_input)
{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102],
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

当返回return="pt"时,由于不同句子之间的字数不同,需要用到padding方法补充至相同长度

Padding
python 复制代码
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"], padding=True, return_tensors="pt"
)
print(encoded_input)

{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102,     0,     0,     0,     0],
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

在较短的句子处补0,并在attention_mask中备注0表示这部分是padding没有实际意义

如果是手动传入多个input_ids制作张量时,可以使用tokenizer.pad_token_id进行填充

python 复制代码
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

可以发现,缺失一个填充值在批输入时有不同的结果 。这是由于模型中Attention部分会把所有tokens纳入考虑,因此需要使用Attention_mask告诉Attention层这些tokens不需要管

python 复制代码
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)

添加Attention_mask后,得到与直接输入相同的结果

Truncating

当张量过长,超过预训练时最大的序列长度时,模型会无法处理,需要及时截断。

python 复制代码
encoded_input = tokenizer(
    "This is a very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very long sentence.",
    truncation=True,
)

通过truncation=True来截断输入的最大size。

可以通过结合填充(padding)和截断(truncating)参数来控制输出张量形式

python 复制代码
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"],
    padding=True,
    truncation=True,
    max_length=5,
    return_tensors="pt",
)
print(encoded_input)

{'input_ids': tensor([[  101,  1731,  1132,  1128,   102],
         [  101,  1045,  1005,  1049,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1]])}
Special Tokens
python 复制代码
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
"i've been waiting for a huggingface course my whole life."

如果手动 的将输入文本进行分词-> 手动将分词进行tokens转化。那么这种方法不会自动的加上 [cls]/[sep]等特殊的tokens,而在Bert等系列中,这些特殊tokens在预训练中已被加入,因此需要自己手动添加特殊tokens

但若使用tokenizer.tokenize()方法,模型自动识别 checkpoint是否需要特殊tokens自动添加

编码API

python 复制代码
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)

首先,它可以对单个序列进行分词化

python 复制代码
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
model_inputs = tokenizer(sequences)

它还能同时处理多个序列,且 API 不变

python 复制代码
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

还可以根据多个目标来填充

python 复制代码
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

还可以截断序列,并且可以和最大长度混用。

python 复制代码
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

API还可以指定返回的列表格式tensor还是numpy

相关推荐
zzz的学习笔记本几秒前
AI智能体时代的记忆 笔记(由大模型生成)
人工智能·智能体
AGI-四顾7 分钟前
文生图模型选型速览
人工智能·ai
大尚来也7 分钟前
一篇搞懂AI通识:用大白话讲清人工智能的核心逻辑
人工智能
Coder_Boy_9 分钟前
Deeplearning4j+ Spring Boot 电商用户复购预测案例
java·人工智能·spring boot·后端·spring
风指引着方向13 分钟前
动态形状算子支持:CANN ops-nn 的灵活推理方案
人工智能·深度学习·神经网络
weixin_3954489115 分钟前
cursor日志
人工智能·python·机器学习
凤希AI伴侣18 分钟前
你觉得,AI能让你“一人成军”吗?我的工具流与真实体验
人工智能·凤希ai伴侣
23遇见20 分钟前
从底层到落地:cann/ops-nn 算子库的技术演进与实践
人工智能
DeanWinchester_mh27 分钟前
DeepSeek新论文火了:不用卷算力,一个数学约束让大模型更聪明
人工智能·学习
dixiuapp28 分钟前
学校后勤报修系统哪个好,如何选择
大数据·人工智能·工单管理系统·院校工单管理系统·物业报修系统