【HuggingFace LLM】经典NLP Tasks数据流转

创建Transformer

python 复制代码
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")

from_pretrained 方法从Hugging Face Hub下载并缓存模型数据。如前所述,检查点名称对应于特定的模型架构和权重,在本例中是具有基本架构(12 层、768 个隐藏大小、12 个注意力头)和大小写输入(意味着大写/小写区分很重要)的 BERT 模型。

加载与保存

python 复制代码
model.save_pretrained("directory_on_my_computer")

保存到本地config.jsonmodel.safetensor

其中:

  • config.json是构建模型架构所需的所有必要属性;
  • model.safetensor是一个状态字典,包含了模型的所有权重;

文本编码

文本编码后,通常经过分词器得到一段话对应的数字列表。

python 复制代码
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encoded_input = tokenizer("Hello, I'm a single sentence!")
>>>encoded_input

{'input_ids': [101, 8667, 117, 1000, 1045, 1005, 1049, 2235, 17662, 12172, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

其中input_ids是tokens对应的数字,token_type_ids是指tokens来自哪句话,attention_mask指模型应该关注哪个tokens

通常也可以通过tokenizer.decode解析一个token列表为原话

python 复制代码
>>>tokenizer.decode(encoded_input["input_ids"])
"[CLS] Hello, I'm a single sentence! [SEP]"

其中[cls][sep]这类被称作特殊tokens,当模型预训练时使用 了这些特殊tokens,那么分词器中就需要添加,否则模型也可以不输入他们。

也可以同时输入(batch)多个句子同时进行编码

python 复制代码
encoded_input = tokenizer("How are you?", "I'm fine, thank you!")
print(encoded_input)
{'input_ids': [[101, 1731, 1132, 1128, 136, 102], [101, 1045, 1005, 1049, 2503, 117, 5763, 1128, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
 
encoded_input = tokenizer("How are you?", "I'm fine, thank you!", return_tensors="pt")
print(encoded_input)
{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102],
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

当返回return="pt"时,由于不同句子之间的字数不同,需要用到padding方法补充至相同长度

Padding
python 复制代码
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"], padding=True, return_tensors="pt"
)
print(encoded_input)

{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102,     0,     0,     0,     0],
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

在较短的句子处补0,并在attention_mask中备注0表示这部分是padding没有实际意义

如果是手动传入多个input_ids制作张量时,可以使用tokenizer.pad_token_id进行填充

python 复制代码
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

可以发现,缺失一个填充值在批输入时有不同的结果 。这是由于模型中Attention部分会把所有tokens纳入考虑,因此需要使用Attention_mask告诉Attention层这些tokens不需要管

python 复制代码
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)

添加Attention_mask后,得到与直接输入相同的结果

Truncating

当张量过长,超过预训练时最大的序列长度时,模型会无法处理,需要及时截断。

python 复制代码
encoded_input = tokenizer(
    "This is a very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very long sentence.",
    truncation=True,
)

通过truncation=True来截断输入的最大size。

可以通过结合填充(padding)和截断(truncating)参数来控制输出张量形式

python 复制代码
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"],
    padding=True,
    truncation=True,
    max_length=5,
    return_tensors="pt",
)
print(encoded_input)

{'input_ids': tensor([[  101,  1731,  1132,  1128,   102],
         [  101,  1045,  1005,  1049,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1]])}
Special Tokens
python 复制代码
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
"i've been waiting for a huggingface course my whole life."

如果手动 的将输入文本进行分词-> 手动将分词进行tokens转化。那么这种方法不会自动的加上 [cls]/[sep]等特殊的tokens,而在Bert等系列中,这些特殊tokens在预训练中已被加入,因此需要自己手动添加特殊tokens

但若使用tokenizer.tokenize()方法,模型自动识别 checkpoint是否需要特殊tokens自动添加

编码API

python 复制代码
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)

首先,它可以对单个序列进行分词化

python 复制代码
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
model_inputs = tokenizer(sequences)

它还能同时处理多个序列,且 API 不变

python 复制代码
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

还可以根据多个目标来填充

python 复制代码
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

还可以截断序列,并且可以和最大长度混用。

python 复制代码
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

API还可以指定返回的列表格式tensor还是numpy

相关推荐
leo__5204 小时前
基于MATLAB的交互式多模型跟踪算法(IMM)实现
人工智能·算法·matlab
脑极体4 小时前
云厂商的AI决战
人工智能
njsgcs4 小时前
NVIDIA NitroGen 是强化学习还是llm
人工智能
知乎的哥廷根数学学派5 小时前
基于多模态特征融合和可解释性深度学习的工业压缩机异常分类与预测性维护智能诊断(Python)
网络·人工智能·pytorch·python·深度学习·机器学习·分类
mantch5 小时前
Nano Banana进行AI绘画中文总是糊?一招可重新渲染,清晰到可直接汇报
人工智能·aigc
编程小白_正在努力中5 小时前
第1章 机器学习基础
人工智能·机器学习
wyw00005 小时前
目标检测之SSD
人工智能·目标检测·计算机视觉
AKAMAI6 小时前
圆满循环:Akamai 的演进如何为 AI 推理时代奠定基石
人工智能·云计算
幻云20106 小时前
AI自动化编排:从入门到精通(基于Dify构建AI智能系统)
运维·人工智能·自动化
CoderJia程序员甲6 小时前
GitHub 热榜项目 - 日榜(2026-1-13)
人工智能·ai·大模型·github·ai教程