NLP基础概念完整指南
总目录
-
[第四章 大语言模型](#第四章 大语言模型)
-
[第五章 动手搭建大模型](#第五章 动手搭建大模型)
-
[第六章 大模型训练实践](#第六章 大模型训练实践)
-
[第七章 大模型应用](#第七章 大模型应用)
章节目录
- [NLP 基础概念](#NLP 基础概念)
- [1.1 什么是 NLP](#1.1 什么是 NLP)
- [1.2 NLP 发展历程](#1.2 NLP 发展历程)
- [1.3 NLP 任务](#1.3 NLP 任务)
- [1.3.1 中文分词](#1.3.1 中文分词)
- [1.3.2 子词切分](#1.3.2 子词切分)
- [1.3.3 词性标注](#1.3.3 词性标注)
- [1.3.4 文本分类](#1.3.4 文本分类)
- [1.3.5 实体识别](#1.3.5 实体识别)
- [1.3.6 关系抽取](#1.3.6 关系抽取)
- [1.3.7 文本摘要](#1.3.7 文本摘要)
- [1.3.8 机器翻译](#1.3.8 机器翻译)
- [1.3.9 自动问答](#1.3.9 自动问答)
- [1.4 文本表示的发展历程](#1.4 文本表示的发展历程)
- [1.4.1 词向量](#1.4.1 词向量)
- [1.4.2 语言模型](#1.4.2 语言模型)
- [1.4.3 Word2Vec](#1.4.3 Word2Vec)
- [1.4.4 ELMo](#1.4.4 ELMo)
1.3 NLP 任务
在NLP的技术体系中,各种具体任务构成了应用的基石。这些任务从文本的基础处理延伸到复杂的语义理解和内容生成,涵盖了语言处理的各个层面。每项任务都有其独特的技术挑战、评估指标和应用场景,它们相互关联、层层递进,共同构建了完整的NLP技术栈。
1.3.1 中文分词
中文分词(Chinese Word Segmentation, CWS)是中文自然语言处理的基础任务,其核心挑战在于中文文本没有天然的词边界标识符(如英文的空格),需要通过算法来识别和切分有意义的词汇单元。
技术挑战与难点:
- 切分歧义:同一字符序列可能有多种合理的切分方式
- 未登录词识别:新词、专有名词、外来词等未在词典中出现的词汇
- 上下文依赖:词汇边界往往依赖于上下文语义信息
主流技术方案:
- 基于词典的最大匹配算法:正向/反向/双向最大匹配
- 基于统计的序列标注:HMM、CRF等模型将分词转化为字符级标注问题
- 基于神经网络的端到端学习:BiLSTM-CRF、Transformer等深度模型
python
# 基于BiLSTM-CRF的中文分词系统伪代码
class ChineseWordSegmenter:
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_tags):
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.bilstm = nn.LSTM(embedding_dim, hidden_dim,
bidirectional=True, batch_first=True)
self.hidden2tag = nn.Linear(hidden_dim * 2, num_tags)
self.crf = CRF(num_tags)
# 标注体系: B-词首 I-词中 E-词尾 S-单字词
self.tag2idx = {'B': 0, 'I': 1, 'E': 2, 'S': 3}
def forward(self, sentences, tags=None):
embeddings = self.embedding(sentences)
lstm_out, _ = self.bilstm(embeddings)
emissions = self.hidden2tag(lstm_out)
if tags is not None:
# 训练阶段
loss = -self.crf.log_likelihood(emissions, tags)
return loss
else:
# 预测阶段
best_paths = self.crf.viterbi_decode(emissions)
return best_paths
def segment(self, text):
char_ids = [self.char2idx.get(char, self.char2idx['<UNK>'])
for char in text]
char_tensor = torch.LongTensor([char_ids])
tag_sequence = self.forward(char_tensor)
# 根据标注序列重建分词结果
words = []
current_word = ""
for char, tag in zip(text, tag_sequence[0]):
if tag == 'B' or tag == 'S':
if current_word:
words.append(current_word)
current_word = char
else: # 'I' or 'E'
current_word += char
if tag == 'E' or tag == 'S':
words.append(current_word)
current_word = ""
return words
# 实际使用示例
segmenter = ChineseWordSegmenter(vocab_size=5000, embedding_dim=128,
hidden_dim=256, num_tags=4)
text = "雍和宫的荷花开的很好"
result = segmenter.segment(text)
print(f"输入: {text}")
print(f"分词结果: {result}")
# 输出: ['雍和宫', '的', '荷花', '开', '的', '很', '好']
1.3.2 子词切分
子词切分(Subword Segmentation)技术旨在将词汇进一步分解为更细粒度的语义单元,有效解决词汇稀疏性和未登录词问题。这项技术在现代预训练语言模型中发挥着关键作用,使模型能够处理开放词汇表场景。
核心优势:
- 缓解数据稀疏性:通过子词单元减少词汇表规模
- 处理未登录词:将未见词汇分解为已知子词组合
- 跨语言一致性:为不同语言提供统一的文本表示方案
主流算法对比:
算法 | 基本思想 | 优势 | 局限性 |
---|---|---|---|
BPE | 贪心合并高频字符对 | 简单高效,可控词汇表大小 | 可能破坏语义完整性 |
WordPiece | 基于似然最大化的合并策略 | 保持语义一致性 | 计算复杂度较高 |
Unigram | 期望最大化算法 | 提供多种切分方案 | 实现复杂 |
SentencePiece | 端到端无需预分词 | 语言无关性 | 配置参数较多 |
python
# BPE算法实现伪代码
class BPETokenizer:
def __init__(self, vocab_size=30000):
self.vocab_size = vocab_size
self.word_freq = {}
self.vocab = set()
self.merges = []
def train(self, corpus):
# 1. 统计词频,初始化为字符级别
for sentence in corpus:
words = sentence.split()
for word in words:
word_chars = ' '.join(list(word)) + ' </w>'
self.word_freq[word_chars] = self.word_freq.get(word_chars, 0) + 1
# 2. 建立初始词汇表(所有字符)
for word in self.word_freq:
for char in word.split():
self.vocab.add(char)
# 3. 迭代合并最频繁的字符对
while len(self.vocab) < self.vocab_size:
pairs = self.get_pairs()
if not pairs:
break
best_pair = max(pairs, key=pairs.get)
self.vocab.add(''.join(best_pair))
self.merges.append(best_pair)
# 更新词频统计
self.merge_vocab(best_pair)
def get_pairs(self):
pairs = defaultdict(int)
for word, freq in self.word_freq.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pairs[(symbols[i], symbols[i + 1])] += freq
return pairs
def merge_vocab(self, pair):
new_word_freq = {}
bigram = ' '.join(pair)
replacement = ''.join(pair)
for word in self.word_freq:
new_word = word.replace(bigram, replacement)
new_word_freq[new_word] = self.word_freq[word]
self.word_freq = new_word_freq
def encode(self, text):
tokens = []
for word in text.split():
word_tokens = self.bpe_encode(word)
tokens.extend(word_tokens)
return tokens
def bpe_encode(self, word):
word = ' '.join(list(word)) + ' </w>'
# 应用学习到的合并规则
for pair in self.merges:
if ' '.join(pair) in word:
word = word.replace(' '.join(pair), ''.join(pair))
return word.split()
# 使用示例
tokenizer = BPETokenizer(vocab_size=1000)
corpus = ["unhappiness is common", "happiness brings joy", ...]
tokenizer.train(corpus)
# 编码新文本
tokens = tokenizer.encode("unhappiness")
print(f"BPE tokens: {tokens}")
# 可能输出: ['un', 'happi', 'ness', '</w>']
1.3.3 词性标注
词性标注(Part-of-Speech Tagging, POS Tagging)是为文本中每个词汇分配语法范畴标签的基础任务。准确的词性信息对句法分析、语义理解、信息抽取等下游任务具有重要价值。
技术发展脉络:
- 基于规则的方法:利用人工编写的语法规则进行标注
- 基于统计的方法:HMM、最大熵模型等概率图模型
- 基于神经网络的方法:BiLSTM、Transformer等深度架构
评估指标:
-
准确率(Accuracy):正确标注的词汇比例
-
未知词准确率:模型对训练集中未出现词汇的标注性能
python
# 基于BiLSTM的词性标注器伪代码
class POSTagger:
def __init__(self, vocab_size, tag_size, embedding_dim, hidden_dim):
self.word_embedding = nn.Embedding(vocab_size, embedding_dim)
self.char_embedding = nn.Embedding(256, 50) # 字符级特征
# 字符级BiLSTM
self.char_lstm = nn.LSTM(50, 25, bidirectional=True, batch_first=True)
# 词级BiLSTM
self.word_lstm = nn.LSTM(embedding_dim + 50, hidden_dim,
bidirectional=True, batch_first=True)
self.hidden2tag = nn.Linear(hidden_dim * 2, tag_size)
self.dropout = nn.Dropout(0.5)
def get_char_features(self, words):
char_features = []
for word in words:
chars = [ord(c) for c in word[:20]] # 截断长词
char_embeds = self.char_embedding(torch.LongTensor(chars))
char_lstm_out, (h, c) = self.char_lstm(char_embeds.unsqueeze(0))
# 使用最后时刻的隐状态作为字符特征
word_char_feature = torch.cat([h[0], h[1]], dim=1)
char_features.append(word_char_feature)
return torch.cat(char_features, dim=0)
def forward(self, sentence, sentence_chars):
# 词级嵌入
word_embeds = self.word_embedding(sentence)
# 字符级特征
char_features = self.get_char_features(sentence_chars)
# 拼接词嵌入和字符特征
combined_embeds = torch.cat([word_embeds, char_features.unsqueeze(0)], dim=2)
# BiLSTM编码
lstm_out, _ = self.word_lstm(combined_embeds)
lstm_out = self.dropout(lstm_out)
# 标签预测
tag_logits = self.hidden2tag(lstm_out)
return tag_logits
def predict(self, sentence):
tag_logits = self.forward(sentence)
predicted_tags = torch.argmax(tag_logits, dim=2)
return predicted_tags
# 标注示例
sentence = "She is playing the guitar"
pos_tags = ["PRP", "VBZ", "VBG", "DT", "NN"]
print(f"句子: {sentence}")
print(f"词性标注: {list(zip(sentence.split(), pos_tags))}")
# 输出: [('She', 'PRP'), ('is', 'VBZ'), ('playing', 'VBG'),
# ('the', 'DT'), ('guitar', 'NN')]
1.3.4 文本分类
文本分类(Text Classification)是NLP领域的经典任务,旨在根据文本内容将其归类到预定义的类别中。这项技术在垃圾邮件过滤、情感分析、新闻分类、内容审核等场景中有着广泛应用。
任务类型:
- 二分类:如垃圾邮件检测、情感极性分析
- 多分类:如新闻主题分类、产品类别识别
- 多标签分类:如文档标签标注、电影类型分类
- 层次分类:如学科分类、商品类目分类
技术演进:
- 传统方法:朴素贝叶斯、SVM + TF-IDF特征
- 深度学习方法:CNN、RNN、BERT等预训练模型
- 少样本学习:原型网络、元学习、提示学习
基于BERT的文本分类器伪代码
python
# 基于BERT的文本分类器伪代码
class BERTClassifier:
def __init__(self, bert_model_name, num_classes, max_length=512):
self.bert = BertModel.from_pretrained(bert_model_name)
self.dropout = nn.Dropout(0.1)
self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)
self.max_length = max_length
def forward(self, input_ids, attention_mask, token_type_ids=None):
# BERT编码
outputs = self.bert(input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids)
# 使用[CLS]标记的表示进行分类
pooled_output = outputs.pooler_output
pooled_output = self.dropout(pooled_output)
logits = self.classifier(pooled_output)
return logits
def predict(self, texts, tokenizer):
predictions = []
for text in texts:
# 文本编码
encoded = tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
# 模型推理
with torch.no_grad():
logits = self.forward(
input_ids=encoded['input_ids'],
attention_mask=encoded['attention_mask']
)
predicted_class = torch.argmax(logits, dim=1).item()
confidence = torch.softmax(logits, dim=1).max().item()
predictions.append({
'class': predicted_class,
'confidence': confidence
})
return predictions
多标签文本分类示例
python
# 多标签文本分类示例
class MultiLabelTextClassifier:
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_labels):
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.bilstm = nn.LSTM(embedding_dim, hidden_dim,
bidirectional=True, batch_first=True)
self.attention = nn.MultiheadAttention(hidden_dim * 2, num_heads=8)
self.classifier = nn.Linear(hidden_dim * 2, num_labels)
def forward(self, input_ids):
# 嵌入层
embeddings = self.embedding(input_ids)
# BiLSTM编码
lstm_out, _ = self.bilstm(embeddings)
# 注意力机制
attn_out, _ = self.attention(lstm_out, lstm_out, lstm_out)
# 全局平均池化
pooled = torch.mean(attn_out, dim=1)
# 多标签预测
logits = self.classifier(pooled)
return torch.sigmoid(logits) # 多标签使用sigmoid激活
使用示例
python
# 使用示例
texts = [
"苹果公司发布了新款iPhone,搭载强大的A17芯片",
"美国总统宣布新的经济政策,股市应声上涨",
"NBA总决赛即将开打,湖人队备战充分"
]
class_labels = ["科技", "政治", "体育"]
classifier = BERTClassifier('bert-base-chinese', num_classes=3)
predictions = classifier.predict(texts, tokenizer)
for text, pred in zip(texts, predictions):
predicted_label = class_labels[pred['class']]
1.3.5 实体识别
命名实体识别(Named Entity Recognition, NER)是信息抽取领域的核心任务,旨在从非结构化文本中识别并分类具有特定语义的实体,如人名、地名、机构名、时间、金额等。NER是构建知识图谱、问答系统、信息检索系统的重要基础。
实体类型体系:
- 通用实体:人名(PER)、地名(LOC)、机构名(ORG)、其他(MISC)
- 领域实体:蛋白质、基因、疾病(生物医学);产品、品牌(电商)
- 细粒度实体:职业、国籍、货币、百分比等数十种类型
技术方案演进:
- 基于规则的方法:正则表达式、词典匹配、规则模板
- 基于机器学习:CRF、SVM等特征工程方法
- 基于深度学习:BiLSTM-CRF、BERT-CRF等端到端模型
- 少样本/零样本NER:原型学习、提示学习、生成式方法
基于BERT-CRF的命名实体识别系统伪代码
python
# 基于BERT-CRF的命名实体识别系统伪代码
class BERTCRFForNER:
def __init__(self, bert_model_name, num_labels):
self.bert = BertModel.from_pretrained(bert_model_name)
self.dropout = nn.Dropout(0.1)
self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
self.crf = CRF(num_labels, batch_first=True)
# BIO标注体系
self.label2id = {
'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-LOC': 3, 'I-LOC': 4,
'B-ORG': 5, 'I-ORG': 6, 'B-MISC': 7, 'I-MISC': 8
}
self.id2label = {v: k for k, v in self.label2id.items()}
def forward(self, input_ids, attention_mask, labels=None):
# BERT编码
bert_output = self.bert(input_ids=input_ids,
attention_mask=attention_mask)
sequence_output = bert_output.last_hidden_state
sequence_output = self.dropout(sequence_output)
# 线性分类层
emissions = self.classifier(sequence_output)
if labels is not None:
# 训练阶段:计算CRF损失
loss = -self.crf(emissions, labels, mask=attention_mask.byte())
return loss
else:
# 预测阶段:CRF解码
predictions = self.crf.decode(emissions, mask=attention_mask.byte())
return predictions
def extract_entities(self, text, tokenizer):
# 文本编码
encoded = tokenizer.encode_plus(
text, return_tensors='pt',
padding=True, truncation=True, max_length=512
)
# 模型预测
with torch.no_grad():
predictions = self.forward(
input_ids=encoded['input_ids'],
attention_mask=encoded['attention_mask']
)[0]
# 解码为实体
tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
entities = self.decode_entities(tokens, predictions)
return entities
def decode_entities(self, tokens, labels):
entities = []
current_entity = {"text": "", "label": "", "start": -1, "end": -1}
for i, (token, label) in enumerate(zip(tokens, labels)):
label_name = self.id2label[label]
if label_name.startswith('B-'):
# 开始新实体
if current_entity["text"]:
entities.append(current_entity.copy())
current_entity = {
"text": token,
"label": label_name[2:], # 去掉B-前缀
"start": i,
"end": i
}
elif label_name.startswith('I-') and current_entity["label"] == label_name[2:]:
# 继续当前实体
current_entity["text"] += token
current_entity["end"] = i
else:
# 实体结束
if current_entity["text"]:
entities.append(current_entity.copy())
current_entity = {"text": "", "label": "", "start": -1, "end": -1}
# 处理最后一个实体
if current_entity["text"]:
entities.append(current_entity)
return entities
嵌套命名实体识别
python
# 嵌套命名实体识别
class NestedNERModel:
def __init__(self, bert_model_name, entity_types):
self.bert = BertModel.from_pretrained(bert_model_name)
self.entity_types = entity_types
# 为每种实体类型训练一个二分类器
self.classifiers = nn.ModuleDict({
entity_type: nn.Linear(self.bert.config.hidden_size, 2)
for entity_type in entity_types
})
def forward(self, input_ids, attention_mask):
bert_output = self.bert(input_ids, attention_mask)
sequence_output = bert_output.last_hidden_state
# 为每种实体类型预测
predictions = {}
for entity_type in self.entity_types:
logits = self.classifiers[entity_type](sequence_output)
predictions[entity_type] = torch.softmax(logits, dim=-1)
return predictions
使用示例
python
# 使用示例
text = "李雷和韩梅梅是北京市海淀区的居民,他们计划在2024年4月7日去上海旅行"
ner_model = BERTCRFForNER('bert-base-chinese', num_labels=9)
entities = ner_model.extract_entities(text, tokenizer)
print(f"输入文本: {text}")
print("识别的实体:")
for entity in entities:
print(f" {entity['text']} -> {entity['label']}")
1.3.6 关系抽取
关系抽取(Relation Extraction, RE)是信息抽取的核心任务之一,旨在从文本中识别实体对之间的语义关系。这项技术是构建知识图谱、智能问答、信息检索系统的关键组件,能够将非结构化文本转换为结构化的三元组知识。
关系类型分类:
- 语义关系:上下位关系、同义关系、反义关系
- 事实关系:出生地、就职于、位于、发生于
- 因果关系:导致、引起、影响、促进
- 时序关系:之前、之后、同时、期间
技术挑战:
- 关系的多样性和复杂性:自然语言表达同一关系的方式多样
- 远程监督的噪声问题:自动标注数据存在标签噪声
- 少样本关系识别:新领域关系类型的快速适应
- 关系方向判断:正确识别关系的主语和宾语
基于BERT的关系分类器伪代码
python
# 基于BERT的关系分类器伪代码
class BERTRelationClassifier:
def __init__(self, bert_model_name, num_relations):
self.bert = BertModel.from_pretrained(bert_model_name)
self.dropout = nn.Dropout(0.1)
self.relation_classifier = nn.Linear(self.bert.config.hidden_size, num_relations)
# 实体位置嵌入
self.entity_embedding = nn.Embedding(4, 50) # [PAD], [E1], [/E1], [E2], [/E2]
def forward(self, input_ids, attention_mask, entity_positions):
# BERT编码
bert_output = self.bert(input_ids, attention_mask)
sequence_output = bert_output.last_hidden_state
# 实体位置信息融入
entity_embeds = self.entity_embedding(entity_positions)
enhanced_output = sequence_output + entity_embeds
# 提取实体对表示
entity_repr = self.extract_entity_representation(enhanced_output, entity_positions)
# 关系分类
relation_logits = self.relation_classifier(entity_repr)
return relation_logits
def extract_entity_representation(self, sequence_output, entity_positions):
batch_size, seq_len, hidden_size = sequence_output.size()
# 提取两个实体的平均表示
entity1_mask = (entity_positions == 1).float().unsqueeze(-1) # [E1]标记
entity2_mask = (entity_positions == 3).float().unsqueeze(-1) # [E2]标记
entity1_repr = torch.sum(sequence_output * entity1_mask, dim=1) / \
torch.sum(entity1_mask, dim=1)
entity2_repr = torch.sum(sequence_output * entity2_mask, dim=1) / \
torch.sum(entity2_mask, dim=1)
# 拼接实体表示
entity_pair_repr = torch.cat([entity1_repr, entity2_repr], dim=1)
return entity_pair_repr
基于图神经网络的关系抽取
python
# 基于图神经网络的关系抽取
class GCNRelationExtractor:
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_relations):
self.word_embedding = nn.Embedding(vocab_size, embedding_dim)
self.pos_embedding = nn.Embedding(50, 50) # 位置嵌入
# 图卷积网络层
self.gcn_layers = nn.ModuleList([
GCNLayer(embedding_dim + 50, hidden_dim),
GCNLayer(hidden_dim, hidden_dim)
])
self.classifier = nn.Linear(hidden_dim * 2, num_relations)
def forward(self, words, pos_tags, dependency_adj, head_idx, tail_idx):
# 词嵌入和位置嵌入
word_embeds = self.word_embedding(words)
pos_embeds = self.pos_embedding(pos_tags)
# 拼接特征
node_features = torch.cat([word_embeds, pos_embeds], dim=-1)
# 图卷积编码
for gcn_layer in self.gcn_layers:
node_features = gcn_layer(node_features, dependency_adj)
node_features = F.relu(node_features)
# 提取头尾实体表示
head_repr = node_features[head_idx]
tail_repr = node_features[tail_idx]
# 关系预测
relation_repr = torch.cat([head_repr, tail_repr], dim=-1)
relation_logits = self.classifier(relation_repr)
return relation_logits
联合实体关系抽取
python
# 联合实体关系抽取
class JointEntityRelationExtractor:
def __init__(self, bert_model_name, entity_labels, relation_labels):
self.bert = BertModel.from_pretrained(bert_model_name)
# 实体识别头
self.entity_classifier = nn.Linear(
self.bert.config.hidden_size, len(entity_labels))
# 关系识别头(基于实体对)
self.relation_classifier = nn.Linear(
self.bert.config.hidden_size * 2, len(relation_labels))
def forward(self, input_ids, attention_mask):
bert_output = self.bert(input_ids, attention_mask)
sequence_output = bert_output.last_hidden_state
# 实体预测
entity_logits = self.entity_classifier(sequence_output)
# 生成所有可能的实体对
entity_pairs = self.generate_entity_pairs(sequence_output, entity_logits)
# 关系预测
relation_predictions = []
for head_repr, tail_repr in entity_pairs:
pair_repr = torch.cat([head_repr, tail_repr], dim=-1)
relation_logits = self.relation_classifier(pair_repr)
relation_predictions.append(relation_logits)
return entity_logits, relation_predictions
使用示例
python
# 使用示例
text = "比尔·盖茨是微软公司的创始人"
# 标记实体位置
marked_text = "[E1]比尔·盖茨[/E1]是[E2]微软公司[/E2]的创始人"
relation_extractor = BERTRelationClassifier('bert-base-chinese', num_relations=50)
# 关系类型
relations = {
0: "创始人",
1: "CEO",
2: "员工",
3: "投资者",
# ... 更多关系类型
}
predicted_relation = relation_extractor.predict(marked_text)
1.3.7 文本摘要
文本摘要(Text Summarization)是自动生成简洁、准确、流畅摘要的NLP任务,旨在保留原文的关键信息和主要观点。随着信息爆炸时代的到来,文本摘要技术在新闻聚合、学术文献处理、报告生成等场景中发挥着越来越重要的作用。
技术分类:
-
抽取式摘要(Extractive Summarization)
- 直接选择原文中的重要句子组成摘要
- 优点:语法正确性好,事实准确性高
- 缺点:连贯性可能较差,表达不够灵活
-
生成式摘要(Abstractive Summarization)
- 理解原文语义后重新组织语言生成摘要
- 优点:表达灵活,连贯性好
- 缺点:可能产生事实错误,计算复杂度高
评估指标:
- ROUGE:基于n-gram重叠的自动评估指标
- BLEU:主要用于机器翻译,也可用于摘要评估
- BERTScore:基于语义相似度的评估方法
基于Transformer的抽取式摘要系统伪代码
python
# 基于Transformer的抽取式摘要系统伪代码
class Extractivesummarizer:
def __init__(self, bert_model_name, max_seq_length=512):
self.bert = BertModel.from_pretrained(bert_model_name)
self.sentence_classifier = nn.Linear(self.bert.config.hidden_size, 1)
self.max_seq_length = max_seq_length
def forward(self, input_ids, attention_mask, sentence_positions):
# BERT编码整篇文档
bert_output = self.bert(input_ids, attention_mask)
sequence_output = bert_output.last_hidden_state
# 提取每个句子的表示
sentence_representations = self.extract_sentence_representations(
sequence_output, sentence_positions)
# 预测每个句子的重要性分数
importance_scores = self.sentence_classifier(sentence_representations)
return torch.sigmoid(importance_scores) # 转换为0-1概率
def extract_sentence_representations(self, sequence_output, sentence_positions):
sentence_reprs = []
for start, end in sentence_positions:
# 使用句子中所有token的平均表示
sentence_repr = torch.mean(sequence_output[start:end], dim=0)
sentence_reprs.append(sentence_repr)
return torch.stack(sentence_reprs)
def summarize(self, document, top_k=3):
sentences = self.split_sentences(document)
# 编码文档
encoded = self.encode_document(document)
# 预测句子重要性
with torch.no_grad():
importance_scores = self.forward(
encoded['input_ids'],
encoded['attention_mask'],
encoded['sentence_positions']
)
# 选择top-k重要句子
top_indices = torch.topk(importance_scores.squeeze(), top_k).indices
summary_sentences = [sentences[i] for i in sorted(top_indices)]
return ' '.join(summary_sentences)
多文档摘要系统
python
# 多文档摘要系统
class MultiDocumentSummarizer:
def __init__(self, bert_model_name):
self.sentence_encoder = SentenceBERT(bert_model_name)
self.cluster_algorithm = KMeans(n_clusters=5)
def summarize_multiple_documents(self, documents, summary_length=200):
# 1. 提取所有句子
all_sentences = []
doc_ids = []
for doc_id, doc in enumerate(documents):
sentences = self.split_sentences(doc)
all_sentences.extend(sentences)
doc_ids.extend([doc_id] * len(sentences))
# 2. 句子编码
sentence_embeddings = self.sentence_encoder.encode(all_sentences)
# 3. 句子聚类
clusters = self.cluster_algorithm.fit_predict(sentence_embeddings)
# 4. 从每个聚类中选择代表句子
summary_sentences = []
for cluster_id in set(clusters):
cluster_sentences = [sent for i, sent in enumerate(all_sentences)
if clusters[i] == cluster_id]
# 选择与聚类中心最近的句子作为代表
cluster_center = np.mean([sentence_embeddings[i] for i, c in enumerate(clusters)
if c == cluster_id], axis=0)
best_sentence = self.find_closest_sentence(cluster_sentences, cluster_center)
summary_sentences.append(best_sentence)
# 5. 按原文顺序重新排列并截断
summary = ' '.join(summary_sentences[:summary_length])
return summary
基于Seq2Seq的生成式摘要系统
python
# 基于Seq2Seq的生成式摘要系统
class AbstractiveSummarizer:
def __init__(self, model_name='t5-base', max_input_length=1024, max_output_length=128):
self.tokenizer = T5Tokenizer.from_pretrained(model_name)
self.model = T5ForConditionalGeneration.from_pretrained(model_name)
self.max_input_length = max_input_length
self.max_output_length = max_output_length
def summarize(self, text, num_beams=4, length_penalty=2.0):
# 添加任务前缀
input_text = f"summarize: {text}"
# 编码输入
input_ids = self.tokenizer.encode(
input_text,
return_tensors='pt',
max_length=self.max_input_length,
truncation=True,
padding=True
)
# 生成摘要
with torch.no_grad():
summary_ids = self.model.generate(
input_ids,
num_beams=num_beams,
max_length=self.max_output_length,
length_penalty=length_penalty,
early_stopping=True
)
# 解码摘要
summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary
层次化摘要系统
python
# 层次化摘要系统
class HierarchicalSummarizer:
def __init__(self):
self.sentence_summarizer = ExtractiveExtractor()
self.paragraph_summarizer = AbstractiveSummarizer()
def hierarchical_summarize(self, long_document):
# 第一层:段落级摘要
paragraphs = self.split_paragraphs(long_document)
paragraph_summaries = []
for paragraph in paragraphs:
if len(paragraph) > 200: # 长段落需要摘要
para_summary = self.sentence_summarizer.summarize(paragraph, top_k=2)
paragraph_summaries.append(para_summary)
else:
paragraph_summaries.append(paragraph)
# 第二层:文档级摘要
combined_text = ' '.join(paragraph_summaries)
final_summary = self.paragraph_summarizer.summarize(combined_text)
return final_summary
使用示例
python
# 使用示例
document = """
2024年第三季度,全球科技行业继续保持强劲增长态势。人工智能技术的快速发展
推动了各个行业的数字化转型进程。大型科技公司纷纷加大在AI领域的投资力度,
推出了众多创新产品和服务。
其中,自然语言处理技术取得了重大突破。新一代大语言模型在理解和生成能力上
达到了前所未有的水平,为智能对话、内容创作、代码生成等应用场景提供了强大
的技术支撑。同时,模型的效率也在不断提升,使得大规模应用成为可能。
展望未来,人工智能技术将继续深度融入各行各业,为社会发展注入新的活力。
预计到2025年,AI相关产业的市场规模将突破万亿美元,成为全球经济增长的
重要引擎。
"""
# 抽取式摘要
extractive_summarizer = ExtractiveExtractor()
extractive_summary = extractive_summarizer.summarize(document, top_k=2)
print(f"抽取式摘要: {extractive_summary}")
# 生成式摘要
abstractive_summarizer = AbstractiveSummarizer()
abstractive_summary = abstractive_summarizer.summarize(document)
print(f"生成式摘要: {abstractive_summary}")
1.3.8 机器翻译
机器翻译(Machine Translation, MT)是NLP领域最具挑战性和应用价值的任务之一,旨在实现不同自然语言之间的自动转换。现代机器翻译系统不仅要处理词汇的对应关系,更要准确传达源语言的语义、语用信息和文化内涵,实现真正的跨语言交流。
技术发展脉络:
- 基于规则的机器翻译(RBMT):依赖专家编写的语法规则和词典
- 统计机器翻译(SMT):基于平行语料的统计学习方法
- 神经机器翻译(NMT):端到端的神经网络架构
- 大模型时代的翻译:多语言预训练模型的zero-shot翻译
核心技术挑战:
- 语序差异:不同语言的句法结构差异巨大
- 一词多义:相同词汇在不同语境下的翻译选择
- 文化适应:习语、俚语、文化特定概念的翻译
- 长距离依赖:长句中跨越多个短语的语义依赖
基于Transformer的神经机器翻译系统伪代码
python
# 基于Transformer的神经机器翻译系统伪代码
class TransformerMT:
def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512,
num_heads=8, num_layers=6, d_ff=2048):
self.d_model = d_model
# 编码器和解码器嵌入层
self.src_embedding = nn.Embedding(src_vocab_size, d_model)
self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
# 位置编码
self.pos_encoding = PositionalEncoding(d_model)
# Transformer编码器
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model, nhead=num_heads, dim_feedforward=d_ff)
self.encoder = nn.TransformerEncoder(encoder_layer, num_layers)
# Transformer解码器
decoder_layer = nn.TransformerDecoderLayer(
d_model=d_model, nhead=num_heads, dim_feedforward=d_ff)
self.decoder = nn.TransformerDecoder(decoder_layer, num_layers)
# 输出投影层
self.output_projection = nn.Linear(d_model, tgt_vocab_size)
def forward(self, src_tokens, tgt_tokens=None):
# 源语言编码
src_embed = self.src_embedding(src_tokens) * math.sqrt(self.d_model)
src_embed = self.pos_encoding(src_embed)
# 编码器
memory = self.encoder(src_embed)
if tgt_tokens is not None:
# 训练模式:teacher forcing
tgt_embed = self.tgt_embedding(tgt_tokens) * math.sqrt(self.d_model)
tgt_embed = self.pos_encoding(tgt_embed)
# 创建目标序列的因果掩码
tgt_mask = self.generate_square_subsequent_mask(tgt_tokens.size(1))
# 解码器
decoder_output = self.decoder(tgt_embed, memory, tgt_mask=tgt_mask)
# 输出层
logits = self.output_projection(decoder_output)
return logits
else:
# 推理模式:自回归生成
return self.generate(memory)
def generate_square_subsequent_mask(self, size):
mask = torch.triu(torch.ones(size, size), diagonal=1)
return mask.bool()
def generate(self, memory, max_length=100, bos_token=1, eos_token=2):
batch_size = memory.size(0)
generated = torch.LongTensor([[bos_token]] * batch_size)
for _ in range(max_length):
tgt_embed = self.tgt_embedding(generated) * math.sqrt(self.d_model)
tgt_embed = self.pos_encoding(tgt_embed)
tgt_mask = self.generate_square_subsequent_mask(generated.size(1))
decoder_output = self.decoder(tgt_embed, memory, tgt_mask=tgt_mask)
# 预测下一个词
next_token_logits = self.output_projection(decoder_output[:, -1, :])
next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
generated = torch.cat([generated, next_token], dim=1)
# 检查是否所有序列都结束
if torch.all(next_token.squeeze() == eos_token):
break
return generated
多语言机器翻译系统
python
# 多语言机器翻译系统
class MultilingualMT:
def __init__(self, languages, shared_vocab_size=50000):
self.languages = languages
self.num_languages = len(languages)
# 共享的多语言编码器
self.shared_encoder = TransformerEncoder(
vocab_size=shared_vocab_size,
d_model=512,
num_heads=8,
num_layers=12
)
# 语言特定的解码器
self.decoders = nn.ModuleDict({
lang: TransformerDecoder(
vocab_size=shared_vocab_size,
d_model=512,
num_heads=8,
num_layers=6
) for lang in languages
})
# 语言标识嵌入
self.lang_embedding = nn.Embedding(self.num_languages, 512)
def forward(self, src_tokens, src_lang, tgt_tokens, tgt_lang):
# 添加源语言标识
src_lang_id = self.get_lang_id(src_lang)
src_lang_embed = self.lang_embedding(src_lang_id)
# 共享编码器编码
encoder_output = self.shared_encoder(src_tokens, src_lang_embed)
# 目标语言特定解码器
tgt_decoder = self.decoders[tgt_lang]
# 添加目标语言标识
tgt_lang_id = self.get_lang_id(tgt_lang)
tgt_lang_embed = self.lang_embedding(tgt_lang_id)
# 解码
output = tgt_decoder(tgt_tokens, encoder_output, tgt_lang_embed)
return output
基于注意力的对齐可视化
python
# 基于注意力的对齐可视化
class AttentionVisualizer:
def __init__(self, model):
self.model = model
self.attention_weights = {}
# 注册hook收集注意力权重
self.register_hooks()
def register_hooks(self):
def attention_hook(module, input, output):
if hasattr(module, 'attention_weights'):
layer_name = module.__class__.__name__
self.attention_weights[layer_name] = output[1] # attention weights
for layer in self.model.modules():
if isinstance(layer, nn.MultiheadAttention):
layer.register_forward_hook(attention_hook)
def visualize_alignment(self, src_sentence, tgt_sentence, src_tokens, tgt_tokens):
# 获取翻译过程中的注意力权重
with torch.no_grad():
_ = self.model(src_tokens, tgt_tokens)
# 提取编码器-解码器注意力权重
enc_dec_attention = self.attention_weights.get('MultiheadAttention', None)
if enc_dec_attention is not None:
# 绘制注意力热力图
self.plot_attention_heatmap(
src_sentence, tgt_sentence, enc_dec_attention)
def plot_attention_heatmap(self, src_words, tgt_words, attention_matrix):
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 8))
sns.heatmap(attention_matrix,
xticklabels=src_words,
yticklabels=tgt_words,
cmap='Blues')
plt.title('Translation Attention Alignment')
plt.xlabel('Source Sentence')
plt.ylabel('Target Sentence')
plt.show()
翻译质量评估
python
# 翻译质量评估
class TranslationEvaluator:
def __init__(self):
self.bleu = BLEUScore()
self.meteor = METEORScore()
self.bertscore = BERTScore()
def evaluate_translation(self, predictions, references):
results = {}
# BLEU评分
results['bleu'] = self.bleu.compute(predictions, references)
# METEOR评分
results['meteor'] = self.meteor.compute(predictions, references)
# BERTScore评分
results['bertscore'] = self.bertscore.compute(predictions, references)
# 人工评估指标
results['adequacy'] = self.human_adequacy_score(predictions, references)
results['fluency'] = self.human_fluency_score(predictions)
return results
使用示例
python
# 使用示例
# 中英翻译系统
zh_en_translator = TransformerMT(
src_vocab_size=50000, # 中文词汇表
tgt_vocab_size=30000, # 英文词汇表
d_model=512,
num_heads=8,
num_layers=6
)
# 翻译示例
chinese_text = "今天天气很好"
english_translation = zh_en_translator.translate(chinese_text)
print(f"中文原文: {chinese_text}")
print(f"英文翻译: {english_translation}")
# 预期输出: "The weather is very nice today"
# 多语言翻译示例
multilingual_translator = MultilingualMT(['zh', 'en', 'fr', 'de', 'ja'])
# 中文到法语的翻译
french_translation = multilingual_translator.translate(
text="我喜欢学习自然语言处理",
src_lang='zh',
tgt_lang='fr'
)
print(f"中文: 我喜欢学习自然语言处理")
print(f"法语: {french_translation}")
# 预期输出: "J'aime étudier le traitement du langage naturel"
1.3.9 自动问答
自动问答(Automatic Question Answering, QA)是NLP领域的高级任务,旨在构建能够理解自然语言问题并提供准确答案的智能系统。这项技术融合了信息检索、阅读理解、知识推理等多项NLP能力,是实现人机智能交互的关键技术。
问答系统分类:
- 事实型问答:回答具体事实性问题(如"北京的人口是多少?")
- 定义型问答:解释概念或术语(如"什么是深度学习?")
- 推理型问答:需要逻辑推理的复杂问题
- 对话型问答:支持多轮对话的交互式问答
技术架构类型:
- 检索式问答(Retrieval-based QA):从文档集合中检索相关段落并提取答案
- 生成式问答(Generative QA):直接生成自然语言答案
- 混合式问答(Hybrid QA):结合检索和生成的优势
基于检索增强生成的问答系统伪代码
python
# 基于检索增强生成的问答系统伪代码
class RetrievalAugmentedQA:
def __init__(self, document_encoder, question_encoder, generator):
self.document_encoder = document_encoder # 文档编码器
self.question_encoder = question_encoder # 问题编码器
self.generator = generator # 答案生成器
self.document_index = None # 文档索引
def build_index(self, documents):
"""构建文档索引"""
document_embeddings = []
for doc in documents:
# 将文档分割成段落
passages = self.split_into_passages(doc)
for passage in passages:
# 编码段落
passage_embedding = self.document_encoder.encode(passage)
document_embeddings.append({
'text': passage,
'embedding': passage_embedding,
'doc_id': doc['id']
})
# 构建向量检索索引
self.document_index = FAISSIndex(document_embeddings)
def retrieve_relevant_passages(self, question, top_k=5):
"""检索相关段落"""
question_embedding = self.question_encoder.encode(question)
# 在索引中搜索最相似的段落
similar_passages = self.document_index.search(
question_embedding, top_k=top_k)
return similar_passages
def answer_question(self, question, top_k=5):
"""回答问题"""
# 1. 检索相关段落
relevant_passages = self.retrieve_relevant_passages(question, top_k)
# 2. 构建生成的输入
context = self.combine_passages(relevant_passages)
generation_input = f"Question: {question}\nContext: {context}\nAnswer:"
# 3. 生成答案
answer = self.generator.generate(generation_input)
# 4. 后处理和验证
verified_answer = self.verify_answer(answer, context)
return {
'answer': verified_answer,
'confidence': self.calculate_confidence(answer, context),
'sources': [p['doc_id'] for p in relevant_passages]
}
基于BERT的阅读理解问答系统
python
# 基于BERT的阅读理解问答系统
class BERTReadingComprehension:
def __init__(self, bert_model_name):
self.bert = BertForQuestionAnswering.from_pretrained(bert_model_name)
self.tokenizer = BertTokenizer.from_pretrained(bert_model_name)
self.max_length = 512
def answer_question(self, question, context):
# 编码问题和上下文
encoded = self.tokenizer.encode_plus(
question, context,
add_special_tokens=True,
max_length=self.max_length,
padding='max_length',
truncation='only_second', # 只截断context
return_tensors='pt'
)
input_ids = encoded['input_ids']
attention_mask = encoded['attention_mask']
token_type_ids = encoded['token_type_ids']
# 模型预测
with torch.no_grad():
outputs = self.bert(input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
# 找到最佳答案span
start_idx = torch.argmax(start_logits)
end_idx = torch.argmax(end_logits)
# 确保end >= start
if end_idx < start_idx:
end_idx = start_idx
# 解码答案
answer_tokens = input_ids[0][start_idx:end_idx+1]
answer = self.tokenizer.decode(answer_tokens, skip_special_tokens=True)
# 计算置信度
start_confidence = torch.softmax(start_logits, dim=1).max().item()
end_confidence = torch.softmax(end_logits, dim=1).max().item()
confidence = (start_confidence + end_confidence) / 2
return {
'answer': answer,
'confidence': confidence,
'start_position': start_idx.item(),
'end_position': end_idx.item()
}
多跳推理问答系统
python
# 多跳推理问答系统
class MultiHopQA:
def __init__(self, knowledge_graph, reasoning_model):
self.knowledge_graph = knowledge_graph
self.reasoning_model = reasoning_model
self.entity_linker = EntityLinker()
def answer_complex_question(self, question):
# 1. 问题分解
sub_questions = self.decompose_question(question)
# 2. 实体链接
entities = self.entity_linker.link_entities(question)
# 3. 多跳推理
reasoning_path = []
current_entities = entities
for sub_question in sub_questions:
# 在知识图谱中寻找相关路径
related_facts = self.knowledge_graph.find_related_facts(
current_entities, sub_question)
reasoning_path.append({
'sub_question': sub_question,
'facts': related_facts,
'entities': current_entities
})
# 更新当前实体集合
current_entities = self.extract_new_entities(related_facts)
# 4. 综合推理得出最终答案
final_answer = self.reasoning_model.synthesize_answer(
question, reasoning_path)
return {
'answer': final_answer,
'reasoning_path': reasoning_path,
'confidence': self.calculate_reasoning_confidence(reasoning_path)
}
对话式问答系统
python
# 对话式问答系统
class ConversationalQA:
def __init__(self, qa_model, context_manager):
self.qa_model = qa_model
self.context_manager = context_manager
self.conversation_history = []
def answer_in_context(self, question, session_id):
# 1. 获取对话上下文
conversation_context = self.context_manager.get_context(session_id)
# 2. 问题理解和指代消解
resolved_question = self.resolve_references(
question, conversation_context)
# 3. 意图识别
intent = self.classify_intent(resolved_question)
if intent == 'follow_up':
# 基于上下文的追问
enhanced_question = self.enhance_with_context(
resolved_question, conversation_context)
elif intent == 'clarification':
# 澄清型问题
return self.handle_clarification(resolved_question, conversation_context)
else:
# 新的独立问题
enhanced_question = resolved_question
# 4. 回答问题
answer = self.qa_model.answer_question(enhanced_question)
# 5. 更新对话上下文
self.context_manager.update_context(
session_id, question, answer, resolved_question)
return answer
问答系统评估框架
python
# 问答系统评估框架
class QAEvaluator:
def __init__(self):
self.exact_match = ExactMatch()
self.f1_score = F1Score()
self.bleu_score = BLEUScore()
self.human_evaluator = HumanEvaluator()
def comprehensive_evaluation(self, qa_system, test_dataset):
results = {
'exact_match': 0,
'f1_score': 0,
'bleu_score': 0,
'answer_coverage': 0,
'response_time': [],
'human_ratings': {}
}
predictions = []
references = []
for sample in test_dataset:
question = sample['question']
ground_truth = sample['answer']
context = sample.get('context', '')
# 测量响应时间
start_time = time.time()
prediction = qa_system.answer_question(question, context)
response_time = time.time() - start_time
predictions.append(prediction['answer'])
references.append(ground_truth)
results['response_time'].append(response_time)
# 自动评估指标
results['exact_match'] = self.exact_match.compute(predictions, references)
results['f1_score'] = self.f1_score.compute(predictions, references)
results['bleu_score'] = self.bleu_score.compute(predictions, references)
# 人工评估
results['human_ratings'] = self.human_evaluator.evaluate(
predictions, references, test_dataset)
return results
# 使用示例
# 构建检索增强问答系统
documents = [
{"id": "doc1", "text": "BERT是Google在2018年发布的预训练语言模型..."},
{"id": "doc2", "text": "Transformer架构使用了自注意力机制..."},
# 更多文档...
]
rag_qa = RetrievalAugmentedQA(
document_encoder=SentenceBERT('sentence-transformers/all-MiniLM-L6-v2'),
question_encoder=SentenceBERT('sentence-transformers/all-MiniLM-L6-v2'),
generator=T5Generator('t5-base')
)
rag_qa.build_index(documents)
问答示例
python
# 问答示例
question = "BERT模型是什么时候发布的?"
answer = rag_qa.answer_question(question)
print(f"问题: {question}")
print(f"答案: {answer['answer']}")
print(f"置信度: {answer['confidence']:.3f}")
print(f"来源: {answer['sources']}")
# 阅读理解问答
reading_qa = BERTReadingComprehension('bert-base-chinese')
context = "BERT(Bidirectional Encoder Representations from Transformers)是Google在2018年提出的预训练语言模型。"
question = "BERT是哪家公司提出的?"
result = reading_qa.answer_question(question, context)
print(f"问题: {question}")
print(f"答案: {result['answer']}")
print(f"置信度: {result['confidence']:.3f}")