NLP--加载与使用预训练模型

1.NLP中的常用预训练模型

2.加载与使用预训练模型的步骤

2.1确定需要加载的预训练模型并安装依赖包

2.2加载预训练模型的映射器tokenizer

1.NLP中的常用预训练模型

BERT
GPT
GPT-2
Transformer-XL
XLNet
XLM
RoBERTa
DistilBERT
ALBERT
T5
XLM-RoBERTa

🐵🐵所有上述预训练模型及其变体都是以transformer为基础，只是在模型结构如神经元连接方式，编码器隐层数，多头注意力的头数等发生改变，这些改变方式的大部分依据都是由在标准数据集上的表现而定，因此，对于我们使用者而言，不需要从理论上深度探究这些预训练模型的结构设计的优劣，只需要在自己处理的目标数据上，尽量遍历所有可用的模型对比得到最优效果即可.

2.加载与使用预训练模型的步骤

第一步: 确定需要加载的预训练模型并安装依赖包.
第二步: 加载预训练模型的映射器tokenizer.
第三步: 加载带/不带头的预训练模型.
第四步: 使用模型获得输出结果.

2.1确定需要加载的预训练模型并安装依赖包

在使用工具加载模型前需要安装必备的依赖包

python 复制代码

pip install tqdm boto3 requests regex sentencepiece sacremoses

2.2加载预训练模型的映射器tokenizer

python 复制代码

import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForQuestionAnswering

mirror='https://mirrors.tuna.tsinghua.edu.cn/help/hugging-face-models/'

def demo24_1_load_tokenizer():
    tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese",mirror='https://mirrors.tuna.tsinghua.edu.cn/help/hugging-face-models/')
    print("tokenizer--->", tokenizer)

demo24_1_load_tokenizer()

2.3加载带/不带头的预训练模型

这里的'头'是指模型的任务输出层, 选择加载不带头的模型, 相当于使用模型对输入文本进行特征表示.
选择加载带头的模型时, 有三种类型的'头'可供选择, AutoModelForMaskedLM (语言模型头), AutoModelForSequenceClassification (分类模型头), AutoModelForQuestionAnswering (问答模型头)
不同类型的'头', 可以使预训练模型输出指定的张量维度. 如使用'分类模型头', 则输出尺寸为(1,2)的张量, 用于进行分类任务判定结果.

3.使用不同的模型获得输出结果

3.1使用不带头的模型输出

python 复制代码

def demo24_3_load_AutoModel():

    # 加载的预训练模型的名字
    model_name = 'bert-base-chinese'

    tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese",mirror=mirror)

    # 2 加载model
    model = AutoModel.from_pretrained(model_name)

    # 3 使用tokenizer 文本数值化
    # 输入的中文文本
    input_text = "人生该如何起头"

    # 使用tokenizer进行数值映射
    indexed_tokens = tokenizer.encode(input_text)

    # 打印映射后的结构
    print("indexed_tokens:", indexed_tokens)

    # 将映射结构转化为张量输送给不带头的预训练模型
    tokens_tensor = torch.tensor([indexed_tokens])

    # 4 使用不带头的预训练模型获得结果
    with torch.no_grad():
        encoded_layers, _ = model(tokens_tensor, return_dict=False)
        # encoded_layers, _ = model(tokens_tensor)

    print("不带头的模型输出结果:", encoded_layers)
    print("不带头的模型输出结果的尺寸:", encoded_layers.shape)

demo24_3_load_AutoModel()

输出结果

python 复制代码

# tokenizer映射后的结果, 101和102是起止符, 
# 中间的每个数字对应"人生该如何起头"的每个字.
indexed_tokens: [101, 782, 4495, 6421, 1963, 862, 6629, 1928, 102]


不带头的模型输出结果: tensor([[[ 0.5421,  0.4526, -0.0179,  ...,  1.0447, -0.1140,  0.0068],
         [-0.1343,  0.2785,  0.1602,  ..., -0.0345, -0.1646, -0.2186],
         [ 0.9960, -0.5121, -0.6229,  ...,  1.4173,  0.5533, -0.2681],
         ...,
         [ 0.0115,  0.2150, -0.0163,  ...,  0.6445,  0.2452, -0.3749],
         [ 0.8649,  0.4337, -0.1867,  ...,  0.7397, -0.2636,  0.2144],
         [-0.6207,  0.1668,  0.1561,  ...,  1.1218, -0.0985, -0.0937]]])


# 输出尺寸为1x9x768, 即每个字已经使用768维的向量进行了表示,
# 我们可以基于此编码结果进行接下来的自定义操作, 如: 编写自己的微调网络进行最终输出.
不带头的模型输出结果的尺寸: torch.Size([1, 9, 768])

3.2使用带有语言模型头的模型进行输出

python 复制代码

def demo24_4_load_AutoLM():

    # 1 加载 tokenizer
    # 加载的预训练模型的名字
    model_name = 'bert-base-chinese'

    tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese",mirror=mirror)

    # 2 加载model
    lm_model =AutoModelForMaskedLM.from_pretrained(model_name)

    # 3 使用tokenizer 文本数值化
    # 输入的中文文本
    input_text = "人生该如何起头"

    # 使用tokenizer进行数值映射
    indexed_tokens = tokenizer.encode(input_text)

    # 打印映射后的结构
    print("indexed_tokens:", indexed_tokens)

    # 将映射结构转化为张量输送给不带头的预训练模型
    tokens_tensor = torch.tensor([indexed_tokens])

    # 使用带有语言模型头的预训练模型获得结果
    with torch.no_grad():
        lm_output = lm_model(tokens_tensor,return_dict=False)

    print("带语言模型头的模型输出结果:", lm_output)
    print("带语言模型头的模型输出结果的尺寸:", lm_output[0].shape)

demo24_4_load_AutoLM()

输出结果

python 复制代码

带语言模型头的模型输出结果: (tensor([[[ -7.9706,  -7.9119,  -7.9317,  ...,  -7.2174,  -7.0263,  -7.3746],
         [ -8.2097,  -8.1810,  -8.0645,  ...,  -7.2349,  -6.9283,  -6.9856],
         [-13.7458, -13.5978, -12.6076,  ...,  -7.6817,  -9.5642, -11.9928],
         ...,
         [ -9.0928,  -8.6857,  -8.4648,  ...,  -8.2368,  -7.5684, -10.2419],
         [ -8.9458,  -8.5784,  -8.6325,  ...,  -7.0547,  -5.3288,  -7.8077],
         [ -8.4154,  -8.5217,  -8.5379,  ...,  -6.7102,  -5.9782,  -7.6909]]]),)

# 输出尺寸为1x9x21128, 即每个字已经使用21128维的向量进行了表示, 
# 同不带头的模型一样, 我们可以基于此编码结果进行接下来的自定义操作, 如: 编写自己的微调网络进行最终输出.
带语言模型头的模型输出结果的尺寸: torch.Size([1, 9, 21128])

🐵🐵代码的基本的格式都差不多，只不过导入的模型不同，还有两种带"头"的语言模型的API接口分别为AutoModelForSequenceClassification，AutoModelForQuestionAnswering