自然语言处理_NLP与Transformer架构

标题

- 引言
- NLP基础概念
- - 什么是自然语言处理？
  - NLP的核心任务
- 传统NLP方法
- - [词袋模型（Bag of Words）](#词袋模型（Bag of Words）)
  - TF-IDF（词频-逆文档频率）
- 词嵌入：词向量的演进
- - Word2Vec
- Transformer架构详解
- 实战项目：文本分类模型
- 预训练语言模型
- - BERT风格的预训练
- 实战项目：情感分析系统
- NLP的未来发展
- - [1. 多模态学习](#1. 多模态学习)
  - [2. 少样本和零样本学习](#2. 少样本和零样本学习)
- 总结
- 延伸学习建议

引言

自然语言处理（Natural Language Processing, NLP）是人工智能领域中专注于计算机与人类语言交互的分支。随着深度学习技术的发展，NLP取得了突破性进展，特别是在2017年Transformer架构提出之后。本文将深入探讨NLP的核心概念、技术发展，以及Transformer架构的革命性影响。

NLP基础概念

什么是自然语言处理？

NLP是让计算机理解、解释和生成人类语言的技术。它包含两个主要方面：

自然语言理解（NLU）：使计算机能够理解文本含义
自然语言生成（NLG）：使计算机能够生成类人文本

NLP的核心任务

文本分类：将文本分配到预定义类别
命名实体识别（NER）：识别文本中的实体
关系抽取：识别实体之间的关系
情感分析：判断文本的情感倾向
机器翻译：将文本从一种语言翻译成另一种
问答系统：根据问题提供答案
文本摘要：生成文本的简短摘要

传统NLP方法

词袋模型（Bag of Words）

词袋模型是最简单的文本表示方法，忽略了词序信息：

python 复制代码

import numpy as np
from collections import Counter

class BagOfWords:
    def __init__(self):
        self.vocabulary = {}
        self.vocabulary_size = 0

    def fit(self, documents):
        """构建词汇表"""
        word_counts = Counter()
        for doc in documents:
            words = doc.lower().split()
            word_counts.update(words)

        # 只保留最常见的词
        most_common = word_counts.most_common(5000)  # 保留5000个最常见词

        self.vocabulary = {word: idx for idx, (word, _) in enumerate(most_common)}
        self.vocabulary_size = len(self.vocabulary)

    def transform(self, documents):
        """将文档转换为词袋向量"""
        vectors = []
        for doc in documents:
            words = doc.lower().split()
            vector = np.zeros(self.vocabulary_size)

            word_count = Counter(words)
            for word, count in word_count.items():
                if word in self.vocabulary:
                    idx = self.vocabulary[word]
                    vector[idx] = count

            vectors.append(vector)

        return np.array(vectors)

# 示例使用
documents = [
    "I love machine learning",
    "Machine learning is fascinating",
    "I enjoy deep learning",
    "Deep learning is a subset of machine learning"
]

bow = BagOfWords()
bow.fit(documents)
vectors = bow.transform(documents)

print("词汇表大小:", bow.vocabulary_size)
print("文档向量形状:", vectors.shape)

TF-IDF（词频-逆文档频率）

TF-IDF改进了词袋模型，考虑了词在文档中的重要性：

python 复制代码

import math

class TFIDF:
    def __init__(self):
        self.vocabulary = {}
        self.idf = {}
        self.vocabulary_size = 0
        self.document_count = 0

    def fit(self, documents):
        """计算IDF值"""
        self.document_count = len(documents)
        word_document_counts = {}
        all_words = set()

        for doc in documents:
            words = set(doc.lower().split())
            all_words.update(words)
            for word in words:
                word_document_counts[word] = word_document_counts.get(word, 0) + 1

        # 构建词汇表和IDF
        self.vocabulary = {word: idx for idx, word in enumerate(all_words)}
        self.vocabulary_size = len(self.vocabulary)

        # 计算IDF
        for word, doc_count in word_document_counts.items():
            self.idf[word] = math.log(self.document_count / (1 + doc_count))

    def transform(self, documents):
        """将文档转换为TF-IDF向量"""
        vectors = []

        for doc in documents:
            words = doc.lower().split()
            word_count = Counter(words)
            total_words = len(words)

            vector = np.zeros(self.vocabulary_size)

            for word, count in word_count.items():
                if word in self.vocabulary:
                    # 计算TF
                    tf = count / total_words
                    # 计算TF-IDF
                    idx = self.vocabulary[word]
                    vector[idx] = tf * self.idf.get(word, 0)

            vectors.append(vector)

        return np.array(vectors)

# 示例使用
tfidf = TFIDF()
tfidf.fit(documents)
tfidf_vectors = tfidf.transform(documents)

print("TF-IDF向量形状:", tfidf_vectors.shape)

词嵌入：词向量的演进

Word2Vec

Word2Vec通过上下文学习词的分布式表示：

python 复制代码

import numpy as np
from collections import defaultdict

class Word2Vec:
    def __init__(self, vector_size=100, window=5, learning_rate=0.025, epochs=100):
        self.vector_size = vector_size
        self.window = window
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.word_vectors = {}
        self.vocab = set()

    def build_vocab(self, sentences):
        """构建词汇表"""
        word_counts = defaultdict(int)
        for sentence in sentences:
            for word in sentence.split():
                word_counts[word.lower()] += 1

        # 过滤低频词
        self.vocab = {word for word, count in word_counts.items() if count > 5}

        # 初始化词向量
        for word in self.vocab:
            self.word_vectors[word] = np.random.uniform(-0.5, 0.5, self.vector_size)

    def sigmoid(self, x):
        """Sigmoid激活函数"""
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def train(self, sentences):
        """训练Word2Vec模型"""
        for epoch in range(self.epochs):
            total_loss = 0

            for sentence in sentences:
                words = [word.lower() for word in sentence.split() if word.lower() in self.vocab]

                for i, target_word in enumerate(words):
                    # 获取上下文词
                    start = max(0, i - self.window)
                    end = min(len(words), i + self.window + 1)
                    context_words = [words[j] for j in range(start, end) if j != i]

                    # 更新词向量
                    for context_word in context_words:
                        # Skip-gram实现
                        target_vector = self.word_vectors[target_word]
                        context_vector = self.word_vectors[context_word]

                        # 计算相似度和损失
                        dot_product = np.dot(target_vector, context_vector)
                        probability = self.sigmoid(dot_product)
                        loss = -np.log(probability + 1e-10)

                        # 更新向量
                        gradient = probability - 1
                        self.word_vectors[target_word] -= self.learning_rate * gradient * context_vector
                        self.word_vectors[context_word] -= self.learning_rate * gradient * target_vector

                        total_loss += loss

            if epoch % 10 == 0:
                print(f"Epoch {epoch}, Average Loss: {total_loss / len(sentences):.4f}")

    def get_vector(self, word):
        """获取词向量"""
        return self.word_vectors.get(word.lower(), None)

    def similarity(self, word1, word2):
        """计算词相似度"""
        vec1 = self.get_vector(word1)
        vec2 = self.get_vector(word2)

        if vec1 is None or vec2 is None:
            return 0

        dot_product = np.dot(vec1, vec2)
        norm1 = np.linalg.norm(vec1)
        norm2 = np.linalg.norm(vec2)

        return dot_product / (norm1 * norm2)

# 示例训练数据
sentences = [
    "the cat sits on the mat",
    "the dog plays in the garden",
    "cats and dogs are pets",
    "the cat is sleeping",
    "dogs love to play",
    "mat is comfortable",
    "garden is beautiful"
]

# 训练Word2Vec
w2v = Word2Vec(vector_size=50, epochs=100)
w2v.build_vocab(sentences)
w2v.train(sentences)

# 测试词相似度
print("'cat'和'dog'的相似度:", w2v.similarity('cat', 'dog'))
print("'cat'和'mat'的相似度:", w2v.similarity('cat', 'mat'))

Transformer架构详解

Transformer的革命性

2017年，Google在论文《Attention Is All You Need》中提出了Transformer架构，彻底改变了NLP领域。Transformer的核心创新是自注意力机制，完全依赖注意力机制，摒弃了传统的循环结构。

自注意力机制

自注意力机制允许模型在处理序列时关注其他位置的信息：

python 复制代码

import numpy as np

class SelfAttention:
    def __init__(self, embed_dim, num_heads=8):
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        # 初始化权重矩阵
        self.W_q = np.random.randn(embed_dim, embed_dim) * 0.01
        self.W_k = np.random.randn(embed_dim, embed_dim) * 0.01
        self.W_v = np.random.randn(embed_dim, embed_dim) * 0.01
        self.W_o = np.random.randn(embed_dim, embed_dim) * 0.01

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        """缩放点积注意力"""
        # 计算注意力分数
        scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(self.head_dim)

        # 应用mask（如果提供）
        if mask is not None:
            scores += mask * -1e9

        # 计算注意力权重
        attention_weights = self.softmax(scores, axis=-1)

        # 应用权重到值向量
        output = np.matmul(attention_weights, V)

        return output, attention_weights

    def softmax(self, x, axis=-1):
        """稳定的softmax实现"""
        x_max = np.max(x, axis=axis, keepdims=True)
        exp_x = np.exp(x - x_max)
        return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

    def forward(self, x, mask=None):
        """前向传播"""
        batch_size, seq_len, embed_dim = x.shape

        # 生成Q, K, V
        Q = np.matmul(x, self.W_q)
        K = np.matmul(x, self.W_k)
        V = np.matmul(x, self.W_v)

        # 重塑为多头
        Q = Q.reshape(batch_size, seq_len, self.num_heads, self.head_dim)
        K = K.reshape(batch_size, seq_len, self.num_heads, self.head_dim)
        V = V.reshape(batch_size, seq_len, self.num_heads, self.head_dim)

        # 转置维度以便并行计算
        Q = Q.transpose(0, 2, 1, 3)
        K = K.transpose(0, 2, 1, 3)
        V = V.transpose(0, 2, 1, 3)

        # 计算注意力
        attention_output, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)

        # 合并多头
        attention_output = attention_output.transpose(0, 2, 1, 3)
        attention_output = attention_output.reshape(batch_size, seq_len, embed_dim)

        # 输出投影
        output = np.matmul(attention_output, self.W_o)

        return output, attention_weights

# 示例使用
embed_dim = 512
num_heads = 8
batch_size = 2
seq_len = 10

# 创建输入（batch_size × seq_len × embed_dim）
x = np.random.randn(batch_size, seq_len, embed_dim)

# 创建自注意力层
attention = SelfAttention(embed_dim, num_heads)

# 前向传播
output, weights = attention.forward(x)

print("输入形状:", x.shape)
print("输出形状:", output.shape)
print("注意力权重形状:", weights.shape)

位置编码

由于Transformer不包含循环结构，需要位置编码来提供位置信息：

python 复制代码

class PositionalEncoding:
    def __init__(self, embed_dim, max_seq_len=5000):
        self.embed_dim = embed_dim
        self.max_seq_len = max_seq_len

        # 创建位置编码矩阵
        position = np.arange(max_seq_len)[:, np.newaxis]
        div_term = np.exp(np.arange(0, embed_dim, 2) * -(np.log(10000.0) / embed_dim))

        pe = np.zeros((max_seq_len, embed_dim))
        pe[:, 0::2] = np.sin(position * div_term)
        pe[:, 1::2] = np.cos(position * div_term)

        self.pe = pe

    def forward(self, x):
        """添加位置编码到输入嵌入"""
        seq_len = x.shape[1]
        return x + self.pe[:seq_len]

# 示例使用
embed_dim = 512
max_seq_len = 100
batch_size = 4
seq_len = 50

# 创建输入嵌入
x = np.random.randn(batch_size, seq_len, embed_dim)

# 添加位置编码
pos_encoding = PositionalEncoding(embed_dim, max_seq_len)
x_with_pos = pos_encoding.forward(x)

print("原始输入形状:", x.shape)
print("添加位置编码后形状:", x_with_pos.shape)

完整的Transformer编码器层

python 复制代码

class TransformerEncoderLayer:
    def __init__(self, embed_dim, num_heads, ff_dim, dropout_rate=0.1):
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.ff_dim = ff_dim
        self.dropout_rate = dropout_rate

        # 自注意力层
        self.self_attention = SelfAttention(embed_dim, num_heads)

        # 前馈网络权重
        self.W1 = np.random.randn(embed_dim, ff_dim) * 0.01
        self.b1 = np.zeros(ff_dim)
        self.W2 = np.random.randn(ff_dim, embed_dim) * 0.01
        self.b2 = np.zeros(embed_dim)

        # 层归一化参数
        self.layer_norm1 = np.ones((1, 1, embed_dim))
        self.layer_norm2 = np.ones((1, 1, embed_dim))

    def relu(self, x):
        return np.maximum(0, x)

    def layer_norm(self, x, epsilon=1e-6):
        """层归一化"""
        mean = np.mean(x, axis=-1, keepdims=True)
        var = np.var(x, axis=-1, keepdims=True)
        norm = (x - mean) / np.sqrt(var + epsilon)
        return norm * self.layer_norm + (1 - self.layer_norm) * x

    def dropout(self, x):
        """Dropout层"""
        if self.dropout_rate > 0:
            mask = np.random.binomial(1, 1 - self.dropout_rate, x.shape)
            return x * mask / (1 - self.dropout_rate)
        return x

    def feed_forward(self, x):
        """位置前馈网络"""
        hidden = self.relu(np.matmul(x, self.W1) + self.b1)
        output = np.matmul(hidden, self.W2) + self.b2
        return output

    def forward(self, x, mask=None):
        """前向传播"""
        # 多头自注意力 + 残差连接 + 层归一化
        attn_output, _ = self.self_attention.forward(x, mask)
        attn_output = self.dropout(attn_output)
        x1 = x + attn_output
        x1 = self.layer_norm(x1)

        # 前馈网络 + 残差连接 + 层归一化
        ff_output = self.feed_forward(x1)
        ff_output = self.dropout(ff_output)
        x2 = x1 + ff_output
        x2 = self.layer_norm(x2)

        return x2

# 示例使用
embed_dim = 512
num_heads = 8
ff_dim = 2048
batch_size = 4
seq_len = 50

# 创建输入
x = np.random.randn(batch_size, seq_len, embed_dim)

# 创建Transformer编码器层
encoder_layer = TransformerEncoderLayer(embed_dim, num_heads, ff_dim)

# 前向传播
output = encoder_layer.forward(x)

print("输入形状:", x.shape)
print("输出形状:", output.shape)

实战项目：文本分类模型

让我们构建一个基于Transformer的文本分类模型：

python 复制代码

class TransformerClassifier:
    def __init__(self, vocab_size, embed_dim, num_classes, num_layers=2, num_heads=8, ff_dim=2048):
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.num_classes = num_classes
        self.num_layers = num_layers

        # 词嵌入
        self.embedding = np.random.randn(vocab_size, embed_dim) * 0.01

        # 位置编码
        self.pos_encoding = PositionalEncoding(embed_dim)

        # Transformer编码器层
        self.encoder_layers = [
            TransformerEncoderLayer(embed_dim, num_heads, ff_dim)
            for _ in range(num_layers)
        ]

        # 分类头
        self.classifier_weights = np.random.randn(embed_dim, num_classes) * 0.01
        self.classifier_bias = np.zeros(num_classes)

    def embed_and_positional(self, input_ids):
        """词嵌入 + 位置编码"""
        # 获取词嵌入
        embeddings = self.embedding[input_ids]

        # 添加位置编码
        embeddings = self.pos_encoding.forward(embeddings)

        return embeddings

    def forward(self, input_ids, mask=None):
        """前向传播"""
        # 词嵌入 + 位置编码
        x = self.embed_and_positional(input_ids)

        # 通过Transformer编码器层
        for layer in self.encoder_layers:
            x = layer.forward(x, mask)

        # 池化（使用第一个token的表示）
        pooled = x[:, 0, :]

        # 分类
        logits = np.matmul(pooled, self.classifier_weights) + self.classifier_bias

        return logits

    def softmax(self, x):
        """Softmax函数"""
        x_max = np.max(x, axis=-1, keepdims=True)
        exp_x = np.exp(x - x_max)
        return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

    def predict(self, input_ids):
        """预测类别"""
        logits = self.forward(input_ids)
        probabilities = self.softmax(logits)
        return np.argmax(probabilities, axis=-1)

# 文本预处理
def preprocess_text(text, vocab, max_length=50):
    """将文本转换为token ID"""
    words = text.lower().split()
    # 截断或填充到固定长度
    if len(words) > max_length:
        words = words[:max_length]
    else:
        words += ['<PAD>'] * (max_length - len(words))

    # 转换为ID
    token_ids = [vocab.get(word, vocab.get('<UNK>', 0)) for word in words]
    return np.array(token_ids)

# 示例数据
texts = [
    "this movie is amazing",
    "I love this film",
    "terrible movie and boring",
    "what a waste of time",
    "excellent acting and story",
    "worst movie ever"
]

labels = [1, 1, 0, 0, 1, 0]  # 1: 正面, 0: 负面

# 构建词汇表
all_words = set()
for text in texts:
    all_words.update(text.lower().split())

vocab = {'<PAD>': 0, '<UNK>': 1}
vocab.update({word: idx+2 for idx, word in enumerate(all_words)})
vocab_size = len(vocab)

# 预处理数据
max_length = 20
X = np.array([preprocess_text(text, vocab, max_length) for text in texts])
y = np.array(labels)

# 创建模型
model = TransformerClassifier(vocab_size, embed_dim=128, num_classes=2)

# 简单训练循环
learning_rate = 0.001
epochs = 100

for epoch in range(epochs):
    # 前向传播
    logits = model.forward(X)

    # 计算损失（交叉熵）
    exp_logits = np.exp(logits - np.max(logits, axis=1, keepdims=True))
    probabilities = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
    loss = -np.mean(np.log(probabilities[np.arange(len(y)), y] + 1e-10))

    # 简化的梯度更新（实际实现需要完整的反向传播）
    if epoch % 20 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

# 测试模型
test_texts = [
    "this is a great movie",
    "I hate this film",
    "not bad but could be better"
]

X_test = np.array([preprocess_text(text, vocab, max_length) for text in test_texts])
predictions = model.predict(X_test)

print("\n测试结果:")
for text, pred in zip(test_texts, predictions):
    sentiment = "正面" if pred == 1 else "负面"
    print(f"文本: '{text}' -> 预测: {sentiment}")

预训练语言模型

BERT风格的预训练

BERT（Bidirectional Encoder Representations from Transformers）引入了两个预训练任务：

掩码语言模型（MLM）：预测被掩码的词
下一句预测（NSP）：判断两个句子是否连续

python 复制代码

class BERTStylePretraining:
    def __init__(self, transformer_model, vocab_size):
        self.transformer = transformer_model
        self.vocab_size = vocab_size

        # MLM预测头
        self.mlm_weights = np.random.randn(transformer_model.embed_dim, vocab_size) * 0.01
        self.mlm_bias = np.zeros(vocab_size)

        # NSP预测头
        self.nsp_weights = np.random.randn(transformer_model.embed_dim, 2) * 0.01
        self.nsp_bias = np.zeros(2)

    def mask_tokens(self, input_ids, mask_prob=0.15):
        """随机掩码token用于MLM任务"""
        masked_input = input_ids.copy()
        mask_labels = np.full(input_ids.shape, -100)  # -100表示忽略该位置

        for i in range(input_ids.shape[0]):
            for j in range(input_ids.shape[1]):
                if np.random.random() < mask_prob:
                    # 80%概率用[MASK]替换
                    if np.random.random() < 0.8:
                        masked_input[i, j] = vocab.get('[MASK]', vocab_size - 1)
                    # 10%概率用随机词替换
                    elif np.random.random() < 0.5:
                        masked_input[i, j] = np.random.randint(0, vocab_size)
                    # 10%概率保持原词

                    mask_labels[i, j] = input_ids[i, j]

        return masked_input, mask_labels

    def forward(self, input_ids, attention_mask=None):
        """前向传播"""
        # 通过Transformer
        transformer_output = self.transformer.forward(input_ids, attention_mask)

        # MLM预测
        mlm_logits = np.matmul(transformer_output, self.mlm_weights) + self.mlm_bias

        # NSP预测（使用[CLS] token）
        nsp_logits = np.matmul(transformer_output[:, 0, :], self.nsp_weights) + self.nsp_bias

        return mlm_logits, nsp_logits, transformer_output

# 创建简单的词汇表
special_tokens = ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]']
vocab = {token: idx for idx, token in enumerate(special_tokens)}
vocab.update({f'token_{i}': idx+len(special_tokens) for i in range(1000)})

# 创建模型（简化版）
transformer = TransformerClassifier(
    vocab_size=len(vocab),
    embed_dim=256,
    num_classes=2,
    num_layers=2
)

# 创建预训练模型
pretraining_model = BERTStylePretraining(transformer, len(vocab))

# 模拟预训练数据
batch_size = 4
seq_len = 32
input_ids = np.random.randint(0, len(vocab), (batch_size, seq_len))
attention_mask = np.ones((batch_size, seq_len))

# 掩码语言模型
masked_input, mask_labels = pretraining_model.mask_tokens(input_ids)

# 前向传播
mlm_logits, nsp_logits, outputs = pretraining_model.forward(masked_input, attention_mask)

print("MLM logits形状:", mlm_logits.shape)
print("NSP logits形状:", nsp_logits.shape)

实战项目：情感分析系统

让我们构建一个完整的情感分析系统：

python 复制代码

class SentimentAnalyzer:
    def __init__(self, model_path=None):
        self.model = None
        self.vocab = None
        self.max_length = 128

    def load_data(self, file_path):
        """加载情感分析数据集"""
        texts = []
        labels = []

        # 模拟数据加载
        # 实际应用中应从文件加载
        sample_data = [
            ("I love this product!", 1),
            ("Terrible experience", 0),
            ("Amazing quality", 1),
            ("Waste of money", 0),
            ("Highly recommended", 1),
            ("Poor customer service", 0)
        ]

        for text, label in sample_data:
            texts.append(text)
            labels.append(label)

        return texts, labels

    def build_vocab(self, texts, min_freq=2):
        """构建词汇表"""
        word_counts = {}

        for text in texts:
            words = text.lower().split()
            for word in words:
                word_counts[word] = word_counts.get(word, 0) + 1

        # 保留出现频率高的词
        self.vocab = {'<PAD>': 0, '<UNK>': 1}
        idx = 2
        for word, count in word_counts.items():
            if count >= min_freq:
                self.vocab[word] = idx
                idx += 1

    def preprocess(self, text):
        """文本预处理"""
        words = text.lower().split()
        tokens = []

        for word in words:
            if word in self.vocab:
                tokens.append(self.vocab[word])
            else:
                tokens.append(self.vocab['<UNK>'])

        # 填充或截断
        if len(tokens) > self.max_length:
            tokens = tokens[:self.max_length]
        else:
            tokens += [self.vocab['<PAD>']] * (self.max_length - len(tokens))

        return tokens

    def train(self, texts, labels, epochs=100, learning_rate=0.001):
        """训练模型"""
        # 构建词汇表
        self.build_vocab(texts)

        # 预处理数据
        X = np.array([self.preprocess(text) for text in texts])
        y = np.array(labels)

        # 创建模型
        self.model = TransformerClassifier(
            vocab_size=len(self.vocab),
            embed_dim=128,
            num_classes=2,
            num_layers=2
        )

        # 训练循环
        for epoch in range(epochs):
            # 前向传播
            logits = self.model.forward(X)

            # 计算损失
            exp_logits = np.exp(logits - np.max(logits, axis=1, keepdims=True))
            probabilities = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
            loss = -np.mean(np.log(probabilities[np.arange(len(y)), y] + 1e-10))

            # 计算准确率
            predictions = np.argmax(probabilities, axis=1)
            accuracy = np.mean(predictions == y)

            if epoch % 20 == 0:
                print(f"Epoch {epoch}: Loss = {loss:.4f}, Accuracy = {accuracy:.4f}")

    def predict(self, text):
        """预测单个文本的情感"""
        if self.model is None:
            raise ValueError("模型尚未训练")

        tokens = self.preprocess(text)
        input_ids = np.array([tokens])

        prediction = self.model.predict(input_ids)[0]
        confidence = self.model.softmax(self.model.forward(input_ids))[0, prediction]

        sentiment = "正面" if prediction == 1 else "负面"

        return sentiment, confidence

# 使用示例
analyzer = SentimentAnalyzer()

# 加载数据
texts, labels = analyzer.load_data("sentiment_data.csv")

# 训练模型
print("开始训练情感分析模型...")
analyzer.train(texts, labels, epochs=100)

# 测试模型
test_texts = [
    "This product exceeded my expectations!",
    "I'm very disappointed with the service",
    "Average quality, nothing special",
    "Outstanding performance and great value"
]

print("\n测试结果:")
for text in test_texts:
    sentiment, confidence = analyzer.predict(text)
    print(f"文本: '{text}'")
    print(f"情感: {sentiment} (置信度: {confidence:.4f})\n")

NLP的未来发展

1. 多模态学习

结合文本、图像、音频等多种模态的信息：

python 复制代码

class MultiModalTransformer:
    def __init__(self, text_embed_dim, image_embed_dim, output_dim):
        self.text_encoder = TransformerClassifier(
            vocab_size=10000, embed_dim=text_embed_dim, num_classes=output_dim
        )

        # 简化的图像编码器
        self.image_encoder_weights = np.random.randn(image_embed_dim, output_dim) * 0.01

        # 融合层
        self.fusion_weights = np.random.randn(output_dim * 2, output_dim) * 0.01

    def forward(self, text_input, image_features):
        """多模态前向传播"""
        # 文本特征
        text_features = self.text_encoder.forward(text_input)

        # 图像特征
        image_encoded = np.matmul(image_features, self.image_encoder_weights)

        # 特征融合
        combined = np.concatenate([text_features, image_encoded], axis=-1)
        fused = np.matmul(combined, self.fusion_weights)

        return fused

# 示例使用
text_input = np.random.randint(0, 10000, (4, 50))  # 4个样本，50个token
image_features = np.random.randn(4, 2048)  # 4个样本，2048维图像特征

multimodal_model = MultiModalTransformer(
    text_embed_dim=256,
    image_embed_dim=2048,
    output_dim=128
)

output = multimodal_model.forward(text_input, image_features)
print("多模态输出形状:", output.shape)

2. 少样本和零样本学习

利用预训练模型进行少样本学习：

python 复制代码

class FewShotClassifier:
    def __init__(self, pretrain_model):
        self.pretrain_model = pretrain_model
        self.support_examples = {}

    def add_support_example(self, label, text):
        """添加支持示例"""
        if label not in self.support_examples:
            self.support_examples[label] = []

        # 获取文本表示
        tokens = np.array([self.preprocess(text)])
        embedding = self.pretrain_model.embed_and_positional(tokens)
        # 池化得到文本表示
        text_repr = np.mean(embedding, axis=1)

        self.support_examples[label].append(text_repr[0])

    def preprocess(self, text):
        """文本预处理"""
        # 简化的预处理
        words = text.lower().split()
        return [hash(word) % 10000 for word in words[:50]]

    def predict(self, text, k=3):
        """使用k-近邻进行预测"""
        # 获取测试文本表示
        tokens = np.array([self.preprocess(text)])
        embedding = self.pretrain_model.embed_and_positional(tokens)
        test_repr = np.mean(embedding, axis=1)[0]

        best_label = None
        best_score = -float('inf')

        # 计算与各类别支持示例的相似度
        for label, examples in self.support_examples.items():
            if len(examples) == 0:
                continue

            # 计算与k个最近邻的平均相似度
            similarities = []
            for example in examples:
                similarity = np.dot(test_repr, example) / (
                    np.linalg.norm(test_repr) * np.linalg.norm(example) + 1e-8
                )
                similarities.append(similarity)

            similarities.sort(reverse=True)
            avg_similarity = np.mean(similarities[:k])

            if avg_similarity > best_score:
                best_score = avg_similarity
                best_label = label

        return best_label, best_score

# 使用示例
# 假设有一个预训练模型
pretrain_model = TransformerClassifier(vocab_size=10000, embed_dim=256, num_classes=10)
few_shot = FewShotClassifier(pretrain_model)

# 添加支持示例
few_shot.add_support_example("positive", "I love this movie!")
few_shot.add_support_example("positive", "Amazing film!")
few_shot.add_support_example("negative", "Terrible acting!")
few_shot.add_support_example("negative", "Worst movie ever!")

# 预测
text = "This is a fantastic film"
prediction, confidence = few_shot.predict(text)
print(f"预测: {prediction}, 置信度: {confidence:.4f}")

总结

本文深入探讨了自然语言处理和Transformer架构的核心概念，包括：

传统NLP方法：词袋模型、TF-IDF等
词嵌入技术：Word2Vec的实现原理
Transformer架构：自注意力机制、位置编码等核心组件
实际应用：文本分类、情感分析等项目的完整实现
前沿发展：多模态学习、少样本学习等新方向

Transformer架构的出现彻底改变了NLP领域，其核心创新------自注意力机制，使得模型能够并行处理序列并捕捉长距离依赖关系。随着BERT、GPT等预训练模型的出现，NLP系统的性能得到了显著提升。

未来，NLP技术将继续向更智能、更通用的方向发展，包括更大的模型规模、更少的训练数据需求、更强的推理能力等。掌握Transformer架构的原理和实现，对于深入理解和应用现代NLP技术至关重要。

延伸学习建议

研究更先进的架构变体（如DeBERTA、RoBERTa等）
探索大规模预训练模型的微调技术
学习提示工程（Prompt Engineering）
了解模型压缩和优化技术
关注伦理和偏见问题
实践端到端的NLP项目开发