【HuggingFace LLM】经典NLP Tasks数据流转

创建Transformer

python 复制代码
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")

from_pretrained 方法从Hugging Face Hub下载并缓存模型数据。如前所述,检查点名称对应于特定的模型架构和权重,在本例中是具有基本架构(12 层、768 个隐藏大小、12 个注意力头)和大小写输入(意味着大写/小写区分很重要)的 BERT 模型。

加载与保存

python 复制代码
model.save_pretrained("directory_on_my_computer")

保存到本地config.jsonmodel.safetensor

其中:

  • config.json是构建模型架构所需的所有必要属性;
  • model.safetensor是一个状态字典,包含了模型的所有权重;

文本编码

文本编码后,通常经过分词器得到一段话对应的数字列表。

python 复制代码
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encoded_input = tokenizer("Hello, I'm a single sentence!")
>>>encoded_input

{'input_ids': [101, 8667, 117, 1000, 1045, 1005, 1049, 2235, 17662, 12172, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

其中input_ids是tokens对应的数字,token_type_ids是指tokens来自哪句话,attention_mask指模型应该关注哪个tokens

通常也可以通过tokenizer.decode解析一个token列表为原话

python 复制代码
>>>tokenizer.decode(encoded_input["input_ids"])
"[CLS] Hello, I'm a single sentence! [SEP]"

其中[cls][sep]这类被称作特殊tokens,当模型预训练时使用 了这些特殊tokens,那么分词器中就需要添加,否则模型也可以不输入他们。

也可以同时输入(batch)多个句子同时进行编码

python 复制代码
encoded_input = tokenizer("How are you?", "I'm fine, thank you!")
print(encoded_input)
{'input_ids': [[101, 1731, 1132, 1128, 136, 102], [101, 1045, 1005, 1049, 2503, 117, 5763, 1128, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
 
encoded_input = tokenizer("How are you?", "I'm fine, thank you!", return_tensors="pt")
print(encoded_input)
{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102],
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

当返回return="pt"时,由于不同句子之间的字数不同,需要用到padding方法补充至相同长度

Padding
python 复制代码
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"], padding=True, return_tensors="pt"
)
print(encoded_input)

{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102,     0,     0,     0,     0],
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

在较短的句子处补0,并在attention_mask中备注0表示这部分是padding没有实际意义

如果是手动传入多个input_ids制作张量时,可以使用tokenizer.pad_token_id进行填充

python 复制代码
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)

可以发现,缺失一个填充值在批输入时有不同的结果 。这是由于模型中Attention部分会把所有tokens纳入考虑,因此需要使用Attention_mask告诉Attention层这些tokens不需要管

python 复制代码
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)

添加Attention_mask后,得到与直接输入相同的结果

Truncating

当张量过长,超过预训练时最大的序列长度时,模型会无法处理,需要及时截断。

python 复制代码
encoded_input = tokenizer(
    "This is a very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very long sentence.",
    truncation=True,
)

通过truncation=True来截断输入的最大size。

可以通过结合填充(padding)和截断(truncating)参数来控制输出张量形式

python 复制代码
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"],
    padding=True,
    truncation=True,
    max_length=5,
    return_tensors="pt",
)
print(encoded_input)

{'input_ids': tensor([[  101,  1731,  1132,  1128,   102],
         [  101,  1045,  1005,  1049,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1]])}
Special Tokens
python 复制代码
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

"[CLS] i've been waiting for a huggingface course my whole life. [SEP]"
"i've been waiting for a huggingface course my whole life."

如果手动 的将输入文本进行分词-> 手动将分词进行tokens转化。那么这种方法不会自动的加上 [cls]/[sep]等特殊的tokens,而在Bert等系列中,这些特殊tokens在预训练中已被加入,因此需要自己手动添加特殊tokens

但若使用tokenizer.tokenize()方法,模型自动识别 checkpoint是否需要特殊tokens自动添加

编码API

python 复制代码
sequence = "I've been waiting for a HuggingFace course my whole life."
model_inputs = tokenizer(sequence)

首先,它可以对单个序列进行分词化

python 复制代码
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]
model_inputs = tokenizer(sequences)

它还能同时处理多个序列,且 API 不变

python 复制代码
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

还可以根据多个目标来填充

python 复制代码
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

还可以截断序列,并且可以和最大长度混用。

python 复制代码
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

API还可以指定返回的列表格式tensor还是numpy

相关推荐
audyxiao0012 分钟前
智能交通顶刊TITS论文分享|如何利用驾驶感知世界模型实现无信号灯路口自动驾驶?
人工智能·机器学习·自动驾驶·tits
lisw057 分钟前
氛围炒股概述!
大数据·人工智能·机器学习
hjs_deeplearning7 分钟前
文献阅读篇#16:自动驾驶中的视觉语言模型:综述与展望
人工智能·语言模型·自动驾驶
爱喝可乐的老王1 小时前
PyTorch深度学习参数初始化和正则化
人工智能·pytorch·深度学习
杭州泽沃电子科技有限公司4 小时前
为电气风险定价:如何利用监测数据评估工厂的“电气安全风险指数”?
人工智能·安全
Godspeed Zhao6 小时前
自动驾驶中的传感器技术24.3——Camera(18)
人工智能·机器学习·自动驾驶
顾北127 小时前
MCP协议实战|Spring AI + 高德地图工具集成教程
人工智能
wfeqhfxz25887827 小时前
毒蝇伞品种识别与分类_Centernet模型优化实战
人工智能·分类·数据挖掘
中杯可乐多加冰8 小时前
RAG 深度实践系列(七):从“能用”到“好用”——RAG 系统优化与效果评估
人工智能·大模型·llm·大语言模型·rag·检索增强生成
珠海西格电力科技8 小时前
微电网系统架构设计:并网/孤岛双模式运行与控制策略
网络·人工智能·物联网·系统架构·云计算·智慧城市