【HuggingFace LLM】经典NLP Tasks数据流转

创建Transformer

python 复制代码
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")

from_pretrained 方法从Hugging Face Hub下载并缓存模型数据。如前所述,检查点名称对应于特定的模型架构和权重,在本例中是具有基本架构(12 层、768 个隐藏大小、12 个注意力头)和大小写输入(意味着大写/小写区分很重要)的 BERT 模型。

加载与保存

python 复制代码
model.save_pretrained("directory_on_my_computer")

保存到本地config.jsonmodel.safetensor

其中:

  • config.json是构建模型架构所需的所有必要属性;
  • model.safetensor是一个状态字典,包含了模型的所有权重;

文本编码

文本编码后,通常经过分词器得到一段话对应的数字列表。

python 复制代码
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encoded_input = tokenizer("Hello, I'm a single sentence!")
>>>encoded_input

{'input_ids': [101, 8667, 117, 1000, 1045, 1005, 1049, 2235, 17662, 12172, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

其中input_ids是tokens对应的数字,token_type_ids是指tokens来自哪句话,attention_mask指模型应该关注哪个tokens

通常也可以通过tokenizer.decode解析一个token列表为原话

python 复制代码
>>>tokenizer.decode(encoded_input["input_ids"])
"[CLS] Hello, I'm a single sentence! [SEP]"

其中[cls][sep]这类被称作特殊tokens,当模型预训练时使用 了这些特殊tokens,那么分词器中就需要添加,否则模型也可以不输入他们。

也可以同时输入(batch)多个句子同时进行编码

python 复制代码
encoded_input = tokenizer("How are you?", "I'm fine, thank you!")
print(encoded_input)
{'input_ids': [[101, 1731, 1132, 1128, 136, 102], [101, 1045, 1005, 1049, 2503, 117, 5763, 1128, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
 
encoded_input = tokenizer("How are you?", "I'm fine, thank you!", return_tensors="pt")
print(encoded_input)
{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102],
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

当返回return="pt"时,由于不同句子之间的字数不同,需要用到padding方法补充至相同长度

Padding
python 复制代码
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"], padding=True, return_tensors="pt"
)
print(encoded_input)

{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102,     0,     0,     0,     0],
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

在较短的句子处补0,并在attention_mask中备注0表示这部分是padding没有实际意义

如果是手动传入多个input_ids制作张量时,可以使用tokenizer.pad_token_id进行填充

python 复制代码
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

可以发现,缺失一个填充值在批输入时有不同的结果 。这是由于模型中Attention部分会把所有tokens纳入考虑,因此需要使用Attention_mask告诉Attention层这些tokens不需要管

python 复制代码
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)

添加Attention_mask后,得到与直接输入相同的结果

Truncating

当张量过长,超过预训练时最大的序列长度时,模型会无法处理,需要及时截断。

python 复制代码
encoded_input = tokenizer(
    "This is a very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very long sentence.",
    truncation=True,
)

通过truncation=True来截断输入的最大size。

可以通过结合填充(padding)和截断(truncating)参数来控制输出张量形式

python 复制代码
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"],
    padding=True,
    truncation=True,
    max_length=5,
    return_tensors="pt",
)
print(encoded_input)

{'input_ids': tensor([[  101,  1731,  1132,  1128,   102],
         [  101,  1045,  1005,  1049,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1]])}
Special Tokens
python 复制代码
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
"i've been waiting for a huggingface course my whole life."

如果手动 的将输入文本进行分词-> 手动将分词进行tokens转化。那么这种方法不会自动的加上 [cls]/[sep]等特殊的tokens,而在Bert等系列中,这些特殊tokens在预训练中已被加入,因此需要自己手动添加特殊tokens

但若使用tokenizer.tokenize()方法,模型自动识别 checkpoint是否需要特殊tokens自动添加

编码API

python 复制代码
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)

首先,它可以对单个序列进行分词化

python 复制代码
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
model_inputs = tokenizer(sequences)

它还能同时处理多个序列,且 API 不变

python 复制代码
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

还可以根据多个目标来填充

python 复制代码
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

还可以截断序列,并且可以和最大长度混用。

python 复制代码
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

API还可以指定返回的列表格式tensor还是numpy

相关推荐
西***63472 小时前
破局信息孤岛 赋能城市智治——分布式可视化系统驱动智慧城市指挥中心升级
人工智能·分布式·智慧城市
zhaodiandiandian2 小时前
AI智能体重构产业生态,从效率革命到体验升级
人工智能·microsoft
weixin_409383122 小时前
强化lora训练 这次好点 下次在训练数据增加正常对话
人工智能·深度学习·机器学习·qwen
喜欢吃豆2 小时前
大语言模型混合专家(MoE)架构深度技术综述
人工智能·语言模型·架构·moe
老蒋新思维2 小时前
创客匠人:当知识IP遇上系统化AI,变现效率如何实现阶跃式突破?
大数据·网络·人工智能·网络协议·tcp/ip·重构·创客匠人
有一个好名字2 小时前
Spring AI 工具调用(Tool Calling):解锁智能应用新能力
java·人工智能·spring
Das12 小时前
【计算机视觉】07_几何变换
人工智能·计算机视觉
却道天凉_好个秋2 小时前
OpenCV(四十六):OBR特征检测
人工智能·opencv·计算机视觉
JosieBook2 小时前
【大模型】用 AI Ping 免费体验 GLM-4.7 与 MiniMax M2.1:从配置到实战的完整教程
数据库·人工智能·redis