标题
-
- 引言
- NLP基础概念
- 传统NLP方法
-
- [词袋模型(Bag of Words)](#词袋模型(Bag of Words))
- TF-IDF(词频-逆文档频率)
- 词嵌入:词向量的演进
- Transformer架构详解
- 实战项目:文本分类模型
- 预训练语言模型
- 实战项目:情感分析系统
- NLP的未来发展
-
- [1. 多模态学习](#1. 多模态学习)
- [2. 少样本和零样本学习](#2. 少样本和零样本学习)
- 总结
- 延伸学习建议
引言
自然语言处理(Natural Language Processing, NLP)是人工智能领域中专注于计算机与人类语言交互的分支。随着深度学习技术的发展,NLP取得了突破性进展,特别是在2017年Transformer架构提出之后。本文将深入探讨NLP的核心概念、技术发展,以及Transformer架构的革命性影响。

NLP基础概念
什么是自然语言处理?
NLP是让计算机理解、解释和生成人类语言的技术。它包含两个主要方面:
- 自然语言理解(NLU):使计算机能够理解文本含义
- 自然语言生成(NLG):使计算机能够生成类人文本
NLP的核心任务
- 文本分类:将文本分配到预定义类别
- 命名实体识别(NER):识别文本中的实体
- 关系抽取:识别实体之间的关系
- 情感分析:判断文本的情感倾向
- 机器翻译:将文本从一种语言翻译成另一种
- 问答系统:根据问题提供答案
- 文本摘要:生成文本的简短摘要
传统NLP方法
词袋模型(Bag of Words)
词袋模型是最简单的文本表示方法,忽略了词序信息:
python
import numpy as np
from collections import Counter
class BagOfWords:
def __init__(self):
self.vocabulary = {}
self.vocabulary_size = 0
def fit(self, documents):
"""构建词汇表"""
word_counts = Counter()
for doc in documents:
words = doc.lower().split()
word_counts.update(words)
# 只保留最常见的词
most_common = word_counts.most_common(5000) # 保留5000个最常见词
self.vocabulary = {word: idx for idx, (word, _) in enumerate(most_common)}
self.vocabulary_size = len(self.vocabulary)
def transform(self, documents):
"""将文档转换为词袋向量"""
vectors = []
for doc in documents:
words = doc.lower().split()
vector = np.zeros(self.vocabulary_size)
word_count = Counter(words)
for word, count in word_count.items():
if word in self.vocabulary:
idx = self.vocabulary[word]
vector[idx] = count
vectors.append(vector)
return np.array(vectors)
# 示例使用
documents = [
"I love machine learning",
"Machine learning is fascinating",
"I enjoy deep learning",
"Deep learning is a subset of machine learning"
]
bow = BagOfWords()
bow.fit(documents)
vectors = bow.transform(documents)
print("词汇表大小:", bow.vocabulary_size)
print("文档向量形状:", vectors.shape)
TF-IDF(词频-逆文档频率)
TF-IDF改进了词袋模型,考虑了词在文档中的重要性:
python
import math
class TFIDF:
def __init__(self):
self.vocabulary = {}
self.idf = {}
self.vocabulary_size = 0
self.document_count = 0
def fit(self, documents):
"""计算IDF值"""
self.document_count = len(documents)
word_document_counts = {}
all_words = set()
for doc in documents:
words = set(doc.lower().split())
all_words.update(words)
for word in words:
word_document_counts[word] = word_document_counts.get(word, 0) + 1
# 构建词汇表和IDF
self.vocabulary = {word: idx for idx, word in enumerate(all_words)}
self.vocabulary_size = len(self.vocabulary)
# 计算IDF
for word, doc_count in word_document_counts.items():
self.idf[word] = math.log(self.document_count / (1 + doc_count))
def transform(self, documents):
"""将文档转换为TF-IDF向量"""
vectors = []
for doc in documents:
words = doc.lower().split()
word_count = Counter(words)
total_words = len(words)
vector = np.zeros(self.vocabulary_size)
for word, count in word_count.items():
if word in self.vocabulary:
# 计算TF
tf = count / total_words
# 计算TF-IDF
idx = self.vocabulary[word]
vector[idx] = tf * self.idf.get(word, 0)
vectors.append(vector)
return np.array(vectors)
# 示例使用
tfidf = TFIDF()
tfidf.fit(documents)
tfidf_vectors = tfidf.transform(documents)
print("TF-IDF向量形状:", tfidf_vectors.shape)
词嵌入:词向量的演进
Word2Vec
Word2Vec通过上下文学习词的分布式表示:
python
import numpy as np
from collections import defaultdict
class Word2Vec:
def __init__(self, vector_size=100, window=5, learning_rate=0.025, epochs=100):
self.vector_size = vector_size
self.window = window
self.learning_rate = learning_rate
self.epochs = epochs
self.word_vectors = {}
self.vocab = set()
def build_vocab(self, sentences):
"""构建词汇表"""
word_counts = defaultdict(int)
for sentence in sentences:
for word in sentence.split():
word_counts[word.lower()] += 1
# 过滤低频词
self.vocab = {word for word, count in word_counts.items() if count > 5}
# 初始化词向量
for word in self.vocab:
self.word_vectors[word] = np.random.uniform(-0.5, 0.5, self.vector_size)
def sigmoid(self, x):
"""Sigmoid激活函数"""
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def train(self, sentences):
"""训练Word2Vec模型"""
for epoch in range(self.epochs):
total_loss = 0
for sentence in sentences:
words = [word.lower() for word in sentence.split() if word.lower() in self.vocab]
for i, target_word in enumerate(words):
# 获取上下文词
start = max(0, i - self.window)
end = min(len(words), i + self.window + 1)
context_words = [words[j] for j in range(start, end) if j != i]
# 更新词向量
for context_word in context_words:
# Skip-gram实现
target_vector = self.word_vectors[target_word]
context_vector = self.word_vectors[context_word]
# 计算相似度和损失
dot_product = np.dot(target_vector, context_vector)
probability = self.sigmoid(dot_product)
loss = -np.log(probability + 1e-10)
# 更新向量
gradient = probability - 1
self.word_vectors[target_word] -= self.learning_rate * gradient * context_vector
self.word_vectors[context_word] -= self.learning_rate * gradient * target_vector
total_loss += loss
if epoch % 10 == 0:
print(f"Epoch {epoch}, Average Loss: {total_loss / len(sentences):.4f}")
def get_vector(self, word):
"""获取词向量"""
return self.word_vectors.get(word.lower(), None)
def similarity(self, word1, word2):
"""计算词相似度"""
vec1 = self.get_vector(word1)
vec2 = self.get_vector(word2)
if vec1 is None or vec2 is None:
return 0
dot_product = np.dot(vec1, vec2)
norm1 = np.linalg.norm(vec1)
norm2 = np.linalg.norm(vec2)
return dot_product / (norm1 * norm2)
# 示例训练数据
sentences = [
"the cat sits on the mat",
"the dog plays in the garden",
"cats and dogs are pets",
"the cat is sleeping",
"dogs love to play",
"mat is comfortable",
"garden is beautiful"
]
# 训练Word2Vec
w2v = Word2Vec(vector_size=50, epochs=100)
w2v.build_vocab(sentences)
w2v.train(sentences)
# 测试词相似度
print("'cat'和'dog'的相似度:", w2v.similarity('cat', 'dog'))
print("'cat'和'mat'的相似度:", w2v.similarity('cat', 'mat'))
Transformer架构详解
Transformer的革命性
2017年,Google在论文《Attention Is All You Need》中提出了Transformer架构,彻底改变了NLP领域。Transformer的核心创新是自注意力机制,完全依赖注意力机制,摒弃了传统的循环结构。
自注意力机制
自注意力机制允许模型在处理序列时关注其他位置的信息:
python
import numpy as np
class SelfAttention:
def __init__(self, embed_dim, num_heads=8):
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
# 初始化权重矩阵
self.W_q = np.random.randn(embed_dim, embed_dim) * 0.01
self.W_k = np.random.randn(embed_dim, embed_dim) * 0.01
self.W_v = np.random.randn(embed_dim, embed_dim) * 0.01
self.W_o = np.random.randn(embed_dim, embed_dim) * 0.01
def scaled_dot_product_attention(self, Q, K, V, mask=None):
"""缩放点积注意力"""
# 计算注意力分数
scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(self.head_dim)
# 应用mask(如果提供)
if mask is not None:
scores += mask * -1e9
# 计算注意力权重
attention_weights = self.softmax(scores, axis=-1)
# 应用权重到值向量
output = np.matmul(attention_weights, V)
return output, attention_weights
def softmax(self, x, axis=-1):
"""稳定的softmax实现"""
x_max = np.max(x, axis=axis, keepdims=True)
exp_x = np.exp(x - x_max)
return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
def forward(self, x, mask=None):
"""前向传播"""
batch_size, seq_len, embed_dim = x.shape
# 生成Q, K, V
Q = np.matmul(x, self.W_q)
K = np.matmul(x, self.W_k)
V = np.matmul(x, self.W_v)
# 重塑为多头
Q = Q.reshape(batch_size, seq_len, self.num_heads, self.head_dim)
K = K.reshape(batch_size, seq_len, self.num_heads, self.head_dim)
V = V.reshape(batch_size, seq_len, self.num_heads, self.head_dim)
# 转置维度以便并行计算
Q = Q.transpose(0, 2, 1, 3)
K = K.transpose(0, 2, 1, 3)
V = V.transpose(0, 2, 1, 3)
# 计算注意力
attention_output, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)
# 合并多头
attention_output = attention_output.transpose(0, 2, 1, 3)
attention_output = attention_output.reshape(batch_size, seq_len, embed_dim)
# 输出投影
output = np.matmul(attention_output, self.W_o)
return output, attention_weights
# 示例使用
embed_dim = 512
num_heads = 8
batch_size = 2
seq_len = 10
# 创建输入(batch_size × seq_len × embed_dim)
x = np.random.randn(batch_size, seq_len, embed_dim)
# 创建自注意力层
attention = SelfAttention(embed_dim, num_heads)
# 前向传播
output, weights = attention.forward(x)
print("输入形状:", x.shape)
print("输出形状:", output.shape)
print("注意力权重形状:", weights.shape)
位置编码
由于Transformer不包含循环结构,需要位置编码来提供位置信息:
python
class PositionalEncoding:
def __init__(self, embed_dim, max_seq_len=5000):
self.embed_dim = embed_dim
self.max_seq_len = max_seq_len
# 创建位置编码矩阵
position = np.arange(max_seq_len)[:, np.newaxis]
div_term = np.exp(np.arange(0, embed_dim, 2) * -(np.log(10000.0) / embed_dim))
pe = np.zeros((max_seq_len, embed_dim))
pe[:, 0::2] = np.sin(position * div_term)
pe[:, 1::2] = np.cos(position * div_term)
self.pe = pe
def forward(self, x):
"""添加位置编码到输入嵌入"""
seq_len = x.shape[1]
return x + self.pe[:seq_len]
# 示例使用
embed_dim = 512
max_seq_len = 100
batch_size = 4
seq_len = 50
# 创建输入嵌入
x = np.random.randn(batch_size, seq_len, embed_dim)
# 添加位置编码
pos_encoding = PositionalEncoding(embed_dim, max_seq_len)
x_with_pos = pos_encoding.forward(x)
print("原始输入形状:", x.shape)
print("添加位置编码后形状:", x_with_pos.shape)
完整的Transformer编码器层
python
class TransformerEncoderLayer:
def __init__(self, embed_dim, num_heads, ff_dim, dropout_rate=0.1):
self.embed_dim = embed_dim
self.num_heads = num_heads
self.ff_dim = ff_dim
self.dropout_rate = dropout_rate
# 自注意力层
self.self_attention = SelfAttention(embed_dim, num_heads)
# 前馈网络权重
self.W1 = np.random.randn(embed_dim, ff_dim) * 0.01
self.b1 = np.zeros(ff_dim)
self.W2 = np.random.randn(ff_dim, embed_dim) * 0.01
self.b2 = np.zeros(embed_dim)
# 层归一化参数
self.layer_norm1 = np.ones((1, 1, embed_dim))
self.layer_norm2 = np.ones((1, 1, embed_dim))
def relu(self, x):
return np.maximum(0, x)
def layer_norm(self, x, epsilon=1e-6):
"""层归一化"""
mean = np.mean(x, axis=-1, keepdims=True)
var = np.var(x, axis=-1, keepdims=True)
norm = (x - mean) / np.sqrt(var + epsilon)
return norm * self.layer_norm + (1 - self.layer_norm) * x
def dropout(self, x):
"""Dropout层"""
if self.dropout_rate > 0:
mask = np.random.binomial(1, 1 - self.dropout_rate, x.shape)
return x * mask / (1 - self.dropout_rate)
return x
def feed_forward(self, x):
"""位置前馈网络"""
hidden = self.relu(np.matmul(x, self.W1) + self.b1)
output = np.matmul(hidden, self.W2) + self.b2
return output
def forward(self, x, mask=None):
"""前向传播"""
# 多头自注意力 + 残差连接 + 层归一化
attn_output, _ = self.self_attention.forward(x, mask)
attn_output = self.dropout(attn_output)
x1 = x + attn_output
x1 = self.layer_norm(x1)
# 前馈网络 + 残差连接 + 层归一化
ff_output = self.feed_forward(x1)
ff_output = self.dropout(ff_output)
x2 = x1 + ff_output
x2 = self.layer_norm(x2)
return x2
# 示例使用
embed_dim = 512
num_heads = 8
ff_dim = 2048
batch_size = 4
seq_len = 50
# 创建输入
x = np.random.randn(batch_size, seq_len, embed_dim)
# 创建Transformer编码器层
encoder_layer = TransformerEncoderLayer(embed_dim, num_heads, ff_dim)
# 前向传播
output = encoder_layer.forward(x)
print("输入形状:", x.shape)
print("输出形状:", output.shape)
实战项目:文本分类模型
让我们构建一个基于Transformer的文本分类模型:
python
class TransformerClassifier:
def __init__(self, vocab_size, embed_dim, num_classes, num_layers=2, num_heads=8, ff_dim=2048):
self.vocab_size = vocab_size
self.embed_dim = embed_dim
self.num_classes = num_classes
self.num_layers = num_layers
# 词嵌入
self.embedding = np.random.randn(vocab_size, embed_dim) * 0.01
# 位置编码
self.pos_encoding = PositionalEncoding(embed_dim)
# Transformer编码器层
self.encoder_layers = [
TransformerEncoderLayer(embed_dim, num_heads, ff_dim)
for _ in range(num_layers)
]
# 分类头
self.classifier_weights = np.random.randn(embed_dim, num_classes) * 0.01
self.classifier_bias = np.zeros(num_classes)
def embed_and_positional(self, input_ids):
"""词嵌入 + 位置编码"""
# 获取词嵌入
embeddings = self.embedding[input_ids]
# 添加位置编码
embeddings = self.pos_encoding.forward(embeddings)
return embeddings
def forward(self, input_ids, mask=None):
"""前向传播"""
# 词嵌入 + 位置编码
x = self.embed_and_positional(input_ids)
# 通过Transformer编码器层
for layer in self.encoder_layers:
x = layer.forward(x, mask)
# 池化(使用第一个token的表示)
pooled = x[:, 0, :]
# 分类
logits = np.matmul(pooled, self.classifier_weights) + self.classifier_bias
return logits
def softmax(self, x):
"""Softmax函数"""
x_max = np.max(x, axis=-1, keepdims=True)
exp_x = np.exp(x - x_max)
return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
def predict(self, input_ids):
"""预测类别"""
logits = self.forward(input_ids)
probabilities = self.softmax(logits)
return np.argmax(probabilities, axis=-1)
# 文本预处理
def preprocess_text(text, vocab, max_length=50):
"""将文本转换为token ID"""
words = text.lower().split()
# 截断或填充到固定长度
if len(words) > max_length:
words = words[:max_length]
else:
words += ['<PAD>'] * (max_length - len(words))
# 转换为ID
token_ids = [vocab.get(word, vocab.get('<UNK>', 0)) for word in words]
return np.array(token_ids)
# 示例数据
texts = [
"this movie is amazing",
"I love this film",
"terrible movie and boring",
"what a waste of time",
"excellent acting and story",
"worst movie ever"
]
labels = [1, 1, 0, 0, 1, 0] # 1: 正面, 0: 负面
# 构建词汇表
all_words = set()
for text in texts:
all_words.update(text.lower().split())
vocab = {'<PAD>': 0, '<UNK>': 1}
vocab.update({word: idx+2 for idx, word in enumerate(all_words)})
vocab_size = len(vocab)
# 预处理数据
max_length = 20
X = np.array([preprocess_text(text, vocab, max_length) for text in texts])
y = np.array(labels)
# 创建模型
model = TransformerClassifier(vocab_size, embed_dim=128, num_classes=2)
# 简单训练循环
learning_rate = 0.001
epochs = 100
for epoch in range(epochs):
# 前向传播
logits = model.forward(X)
# 计算损失(交叉熵)
exp_logits = np.exp(logits - np.max(logits, axis=1, keepdims=True))
probabilities = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
loss = -np.mean(np.log(probabilities[np.arange(len(y)), y] + 1e-10))
# 简化的梯度更新(实际实现需要完整的反向传播)
if epoch % 20 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")
# 测试模型
test_texts = [
"this is a great movie",
"I hate this film",
"not bad but could be better"
]
X_test = np.array([preprocess_text(text, vocab, max_length) for text in test_texts])
predictions = model.predict(X_test)
print("\n测试结果:")
for text, pred in zip(test_texts, predictions):
sentiment = "正面" if pred == 1 else "负面"
print(f"文本: '{text}' -> 预测: {sentiment}")
预训练语言模型
BERT风格的预训练
BERT(Bidirectional Encoder Representations from Transformers)引入了两个预训练任务:
- 掩码语言模型(MLM):预测被掩码的词
- 下一句预测(NSP):判断两个句子是否连续
python
class BERTStylePretraining:
def __init__(self, transformer_model, vocab_size):
self.transformer = transformer_model
self.vocab_size = vocab_size
# MLM预测头
self.mlm_weights = np.random.randn(transformer_model.embed_dim, vocab_size) * 0.01
self.mlm_bias = np.zeros(vocab_size)
# NSP预测头
self.nsp_weights = np.random.randn(transformer_model.embed_dim, 2) * 0.01
self.nsp_bias = np.zeros(2)
def mask_tokens(self, input_ids, mask_prob=0.15):
"""随机掩码token用于MLM任务"""
masked_input = input_ids.copy()
mask_labels = np.full(input_ids.shape, -100) # -100表示忽略该位置
for i in range(input_ids.shape[0]):
for j in range(input_ids.shape[1]):
if np.random.random() < mask_prob:
# 80%概率用[MASK]替换
if np.random.random() < 0.8:
masked_input[i, j] = vocab.get('[MASK]', vocab_size - 1)
# 10%概率用随机词替换
elif np.random.random() < 0.5:
masked_input[i, j] = np.random.randint(0, vocab_size)
# 10%概率保持原词
mask_labels[i, j] = input_ids[i, j]
return masked_input, mask_labels
def forward(self, input_ids, attention_mask=None):
"""前向传播"""
# 通过Transformer
transformer_output = self.transformer.forward(input_ids, attention_mask)
# MLM预测
mlm_logits = np.matmul(transformer_output, self.mlm_weights) + self.mlm_bias
# NSP预测(使用[CLS] token)
nsp_logits = np.matmul(transformer_output[:, 0, :], self.nsp_weights) + self.nsp_bias
return mlm_logits, nsp_logits, transformer_output
# 创建简单的词汇表
special_tokens = ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]']
vocab = {token: idx for idx, token in enumerate(special_tokens)}
vocab.update({f'token_{i}': idx+len(special_tokens) for i in range(1000)})
# 创建模型(简化版)
transformer = TransformerClassifier(
vocab_size=len(vocab),
embed_dim=256,
num_classes=2,
num_layers=2
)
# 创建预训练模型
pretraining_model = BERTStylePretraining(transformer, len(vocab))
# 模拟预训练数据
batch_size = 4
seq_len = 32
input_ids = np.random.randint(0, len(vocab), (batch_size, seq_len))
attention_mask = np.ones((batch_size, seq_len))
# 掩码语言模型
masked_input, mask_labels = pretraining_model.mask_tokens(input_ids)
# 前向传播
mlm_logits, nsp_logits, outputs = pretraining_model.forward(masked_input, attention_mask)
print("MLM logits形状:", mlm_logits.shape)
print("NSP logits形状:", nsp_logits.shape)
实战项目:情感分析系统
让我们构建一个完整的情感分析系统:
python
class SentimentAnalyzer:
def __init__(self, model_path=None):
self.model = None
self.vocab = None
self.max_length = 128
def load_data(self, file_path):
"""加载情感分析数据集"""
texts = []
labels = []
# 模拟数据加载
# 实际应用中应从文件加载
sample_data = [
("I love this product!", 1),
("Terrible experience", 0),
("Amazing quality", 1),
("Waste of money", 0),
("Highly recommended", 1),
("Poor customer service", 0)
]
for text, label in sample_data:
texts.append(text)
labels.append(label)
return texts, labels
def build_vocab(self, texts, min_freq=2):
"""构建词汇表"""
word_counts = {}
for text in texts:
words = text.lower().split()
for word in words:
word_counts[word] = word_counts.get(word, 0) + 1
# 保留出现频率高的词
self.vocab = {'<PAD>': 0, '<UNK>': 1}
idx = 2
for word, count in word_counts.items():
if count >= min_freq:
self.vocab[word] = idx
idx += 1
def preprocess(self, text):
"""文本预处理"""
words = text.lower().split()
tokens = []
for word in words:
if word in self.vocab:
tokens.append(self.vocab[word])
else:
tokens.append(self.vocab['<UNK>'])
# 填充或截断
if len(tokens) > self.max_length:
tokens = tokens[:self.max_length]
else:
tokens += [self.vocab['<PAD>']] * (self.max_length - len(tokens))
return tokens
def train(self, texts, labels, epochs=100, learning_rate=0.001):
"""训练模型"""
# 构建词汇表
self.build_vocab(texts)
# 预处理数据
X = np.array([self.preprocess(text) for text in texts])
y = np.array(labels)
# 创建模型
self.model = TransformerClassifier(
vocab_size=len(self.vocab),
embed_dim=128,
num_classes=2,
num_layers=2
)
# 训练循环
for epoch in range(epochs):
# 前向传播
logits = self.model.forward(X)
# 计算损失
exp_logits = np.exp(logits - np.max(logits, axis=1, keepdims=True))
probabilities = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
loss = -np.mean(np.log(probabilities[np.arange(len(y)), y] + 1e-10))
# 计算准确率
predictions = np.argmax(probabilities, axis=1)
accuracy = np.mean(predictions == y)
if epoch % 20 == 0:
print(f"Epoch {epoch}: Loss = {loss:.4f}, Accuracy = {accuracy:.4f}")
def predict(self, text):
"""预测单个文本的情感"""
if self.model is None:
raise ValueError("模型尚未训练")
tokens = self.preprocess(text)
input_ids = np.array([tokens])
prediction = self.model.predict(input_ids)[0]
confidence = self.model.softmax(self.model.forward(input_ids))[0, prediction]
sentiment = "正面" if prediction == 1 else "负面"
return sentiment, confidence
# 使用示例
analyzer = SentimentAnalyzer()
# 加载数据
texts, labels = analyzer.load_data("sentiment_data.csv")
# 训练模型
print("开始训练情感分析模型...")
analyzer.train(texts, labels, epochs=100)
# 测试模型
test_texts = [
"This product exceeded my expectations!",
"I'm very disappointed with the service",
"Average quality, nothing special",
"Outstanding performance and great value"
]
print("\n测试结果:")
for text in test_texts:
sentiment, confidence = analyzer.predict(text)
print(f"文本: '{text}'")
print(f"情感: {sentiment} (置信度: {confidence:.4f})\n")
NLP的未来发展
1. 多模态学习
结合文本、图像、音频等多种模态的信息:
python
class MultiModalTransformer:
def __init__(self, text_embed_dim, image_embed_dim, output_dim):
self.text_encoder = TransformerClassifier(
vocab_size=10000, embed_dim=text_embed_dim, num_classes=output_dim
)
# 简化的图像编码器
self.image_encoder_weights = np.random.randn(image_embed_dim, output_dim) * 0.01
# 融合层
self.fusion_weights = np.random.randn(output_dim * 2, output_dim) * 0.01
def forward(self, text_input, image_features):
"""多模态前向传播"""
# 文本特征
text_features = self.text_encoder.forward(text_input)
# 图像特征
image_encoded = np.matmul(image_features, self.image_encoder_weights)
# 特征融合
combined = np.concatenate([text_features, image_encoded], axis=-1)
fused = np.matmul(combined, self.fusion_weights)
return fused
# 示例使用
text_input = np.random.randint(0, 10000, (4, 50)) # 4个样本,50个token
image_features = np.random.randn(4, 2048) # 4个样本,2048维图像特征
multimodal_model = MultiModalTransformer(
text_embed_dim=256,
image_embed_dim=2048,
output_dim=128
)
output = multimodal_model.forward(text_input, image_features)
print("多模态输出形状:", output.shape)
2. 少样本和零样本学习
利用预训练模型进行少样本学习:
python
class FewShotClassifier:
def __init__(self, pretrain_model):
self.pretrain_model = pretrain_model
self.support_examples = {}
def add_support_example(self, label, text):
"""添加支持示例"""
if label not in self.support_examples:
self.support_examples[label] = []
# 获取文本表示
tokens = np.array([self.preprocess(text)])
embedding = self.pretrain_model.embed_and_positional(tokens)
# 池化得到文本表示
text_repr = np.mean(embedding, axis=1)
self.support_examples[label].append(text_repr[0])
def preprocess(self, text):
"""文本预处理"""
# 简化的预处理
words = text.lower().split()
return [hash(word) % 10000 for word in words[:50]]
def predict(self, text, k=3):
"""使用k-近邻进行预测"""
# 获取测试文本表示
tokens = np.array([self.preprocess(text)])
embedding = self.pretrain_model.embed_and_positional(tokens)
test_repr = np.mean(embedding, axis=1)[0]
best_label = None
best_score = -float('inf')
# 计算与各类别支持示例的相似度
for label, examples in self.support_examples.items():
if len(examples) == 0:
continue
# 计算与k个最近邻的平均相似度
similarities = []
for example in examples:
similarity = np.dot(test_repr, example) / (
np.linalg.norm(test_repr) * np.linalg.norm(example) + 1e-8
)
similarities.append(similarity)
similarities.sort(reverse=True)
avg_similarity = np.mean(similarities[:k])
if avg_similarity > best_score:
best_score = avg_similarity
best_label = label
return best_label, best_score
# 使用示例
# 假设有一个预训练模型
pretrain_model = TransformerClassifier(vocab_size=10000, embed_dim=256, num_classes=10)
few_shot = FewShotClassifier(pretrain_model)
# 添加支持示例
few_shot.add_support_example("positive", "I love this movie!")
few_shot.add_support_example("positive", "Amazing film!")
few_shot.add_support_example("negative", "Terrible acting!")
few_shot.add_support_example("negative", "Worst movie ever!")
# 预测
text = "This is a fantastic film"
prediction, confidence = few_shot.predict(text)
print(f"预测: {prediction}, 置信度: {confidence:.4f}")
总结
本文深入探讨了自然语言处理和Transformer架构的核心概念,包括:
- 传统NLP方法:词袋模型、TF-IDF等
- 词嵌入技术:Word2Vec的实现原理
- Transformer架构:自注意力机制、位置编码等核心组件
- 实际应用:文本分类、情感分析等项目的完整实现
- 前沿发展:多模态学习、少样本学习等新方向
Transformer架构的出现彻底改变了NLP领域,其核心创新------自注意力机制,使得模型能够并行处理序列并捕捉长距离依赖关系。随着BERT、GPT等预训练模型的出现,NLP系统的性能得到了显著提升。
未来,NLP技术将继续向更智能、更通用的方向发展,包括更大的模型规模、更少的训练数据需求、更强的推理能力等。掌握Transformer架构的原理和实现,对于深入理解和应用现代NLP技术至关重要。
延伸学习建议
- 研究更先进的架构变体(如DeBERTA、RoBERTa等)
- 探索大规模预训练模型的微调技术
- 学习提示工程(Prompt Engineering)
- 了解模型压缩和优化技术
- 关注伦理和偏见问题
- 实践端到端的NLP项目开发