自监督学习是一种无需手工标注数据的学习方法,通过设计合适的预训练任务,让模型从大量的未标注数据中学习特征。自监督学习通过构造伪标签,使模型可以在无监督的条件下进行预训练,然后在特定任务上进行微调。自监督学习的一个常见例子是掩码语言模型(Masked Language Model, MLM),如BERT模型中的应用。。
代表性模型和技术
BERT(Bidirectional Encoder Representations from Transformers):
from transformers import BertTokenizer, BertForMaskedLM
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
input_text = "The capital of France is [MASK]."
inputs = tokenizer(input_text, return_tensors='pt')
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
predicted_token = tokenizer.decode(predictions[0])
print(predicted_token)
T5(Text-To-Text Transfer Transformer):
T5将所有NLP任务转化为文本到文本的格式,通过多任务学习进行预训练。
代码示例:
python复制代码
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
input_text = "translate English to French: The capital of France is Paris."
inputs = tokenizer(input_text, return_tensors='pt')
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0])
print(translated_text)
import openai
openai.api_key = 'YOUR_API_KEY'
prompt = """
Translate the following English text to French: "The capital of France is Paris."
"""
response = openai.Completion.create(
engine="text-davinci-003",
prompt=prompt,
max_tokens=60
)
print(response.choices[0].text.strip())
T0(T5-based Zero-Shot Learner):
概述:T0是一个基于T5架构的模型,通过多任务学习和无监督数据增强,在零样本学习任务中表现出色。
Zero-Shot Learning示例:
python复制代码
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-large')
model = T5ForConditionalGeneration.from_pretrained('t5-large')
input_text = "Translate English to German: The weather is nice today."
inputs = tokenizer(input_text, return_tensors='pt')
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translated_text)
CLIP(Contrastive Language-Image Pre-Training):
概述:CLIP通过对图像和文本的对比学习,在少样本和零样本学习任务中取得了优异表现。
Few-Shot Learning示例
python复制代码
import clip
from PIL import Image
model, preprocess = clip.load("ViT-B/32")
image = preprocess(Image.open("example.jpg")).unsqueeze(0)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(model.device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
logits_per_image, logits_per_text = model(image, text)
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
print("Label probs:", probs)