大模型原理与实践：第一章-NLP基础概念完整指南_第2部分-各种任务（实体识别、关系抽取、文本摘要、机器翻译、自动问答）

NLP基础概念完整指南

总目录

第一章 NLP基础概念完整指南
第二章 Transformer 架构原理
第三章预训练语言模型
[第四章大语言模型](#第四章大语言模型)
1. 第1部分-发展历程、上下文、指令遵循、多模态
2. 第2部分-LLM预训练、监督微调、强化学习
[第五章动手搭建大模型](#第五章动手搭建大模型)
[第六章大模型训练实践](#第六章大模型训练实践)
1. 第1部分-待写
[第七章大模型应用](#第七章大模型应用)
1. 第1部分-待写

章节目录

[NLP 基础概念](#NLP 基础概念)
- [1.1 什么是 NLP](#1.1 什么是 NLP)
- [1.2 NLP 发展历程](#1.2 NLP 发展历程)
- [1.3 NLP 任务](#1.3 NLP 任务)
  - [1.3.1 中文分词](#1.3.1 中文分词)
  - [1.3.2 子词切分](#1.3.2 子词切分)
  - [1.3.3 词性标注](#1.3.3 词性标注)
  - [1.3.4 文本分类](#1.3.4 文本分类)
  - [1.3.5 实体识别](#1.3.5 实体识别)
  - [1.3.6 关系抽取](#1.3.6 关系抽取)
  - [1.3.7 文本摘要](#1.3.7 文本摘要)
  - [1.3.8 机器翻译](#1.3.8 机器翻译)
  - [1.3.9 自动问答](#1.3.9 自动问答)
- [1.4 文本表示的发展历程](#1.4 文本表示的发展历程)
  - [1.4.1 词向量](#1.4.1 词向量)
  - [1.4.2 语言模型](#1.4.2 语言模型)
  - [1.4.3 Word2Vec](#1.4.3 Word2Vec)
  - [1.4.4 ELMo](#1.4.4 ELMo)

1.3 NLP 任务

在NLP的技术体系中，各种具体任务构成了应用的基石。这些任务从文本的基础处理延伸到复杂的语义理解和内容生成，涵盖了语言处理的各个层面。每项任务都有其独特的技术挑战、评估指标和应用场景，它们相互关联、层层递进，共同构建了完整的NLP技术栈。

1.3.1 中文分词

中文分词（Chinese Word Segmentation, CWS）是中文自然语言处理的基础任务，其核心挑战在于中文文本没有天然的词边界标识符（如英文的空格），需要通过算法来识别和切分有意义的词汇单元。

技术挑战与难点：

切分歧义：同一字符序列可能有多种合理的切分方式
未登录词识别：新词、专有名词、外来词等未在词典中出现的词汇
上下文依赖：词汇边界往往依赖于上下文语义信息

主流技术方案：

基于词典的最大匹配算法：正向/反向/双向最大匹配
基于统计的序列标注：HMM、CRF等模型将分词转化为字符级标注问题
基于神经网络的端到端学习：BiLSTM-CRF、Transformer等深度模型

python 复制代码

# 基于BiLSTM-CRF的中文分词系统伪代码
class ChineseWordSegmenter:
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_tags):
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.bilstm = nn.LSTM(embedding_dim, hidden_dim, 
                              bidirectional=True, batch_first=True)
        self.hidden2tag = nn.Linear(hidden_dim * 2, num_tags)
        self.crf = CRF(num_tags)
        
        # 标注体系: B-词首 I-词中 E-词尾 S-单字词
        self.tag2idx = {'B': 0, 'I': 1, 'E': 2, 'S': 3}
    
    def forward(self, sentences, tags=None):
        embeddings = self.embedding(sentences)
        lstm_out, _ = self.bilstm(embeddings)
        emissions = self.hidden2tag(lstm_out)
        
        if tags is not None:
            # 训练阶段
            loss = -self.crf.log_likelihood(emissions, tags)
            return loss
        else:
            # 预测阶段  
            best_paths = self.crf.viterbi_decode(emissions)
            return best_paths
    
    def segment(self, text):
        char_ids = [self.char2idx.get(char, self.char2idx['<UNK>']) 
                   for char in text]
        char_tensor = torch.LongTensor([char_ids])
        
        tag_sequence = self.forward(char_tensor)
        
        # 根据标注序列重建分词结果
        words = []
        current_word = ""
        
        for char, tag in zip(text, tag_sequence[0]):
            if tag == 'B' or tag == 'S':
                if current_word:
                    words.append(current_word)
                current_word = char
            else:  # 'I' or 'E'
                current_word += char
                
            if tag == 'E' or tag == 'S':
                words.append(current_word)
                current_word = ""
                
        return words

# 实际使用示例
segmenter = ChineseWordSegmenter(vocab_size=5000, embedding_dim=128, 
                               hidden_dim=256, num_tags=4)

text = "雍和宫的荷花开的很好"
result = segmenter.segment(text)
print(f"输入: {text}")
print(f"分词结果: {result}")
# 输出: ['雍和宫', '的', '荷花', '开', '的', '很', '好']

1.3.2 子词切分

子词切分（Subword Segmentation）技术旨在将词汇进一步分解为更细粒度的语义单元，有效解决词汇稀疏性和未登录词问题。这项技术在现代预训练语言模型中发挥着关键作用，使模型能够处理开放词汇表场景。

核心优势：

缓解数据稀疏性：通过子词单元减少词汇表规模
处理未登录词：将未见词汇分解为已知子词组合
跨语言一致性：为不同语言提供统一的文本表示方案

主流算法对比：

算法	基本思想	优势	局限性
BPE	贪心合并高频字符对	简单高效，可控词汇表大小	可能破坏语义完整性
WordPiece	基于似然最大化的合并策略	保持语义一致性	计算复杂度较高
Unigram	期望最大化算法	提供多种切分方案	实现复杂
SentencePiece	端到端无需预分词	语言无关性	配置参数较多

python 复制代码

# BPE算法实现伪代码
class BPETokenizer:
    def __init__(self, vocab_size=30000):
        self.vocab_size = vocab_size
        self.word_freq = {}
        self.vocab = set()
        self.merges = []
    
    def train(self, corpus):
        # 1. 统计词频，初始化为字符级别
        for sentence in corpus:
            words = sentence.split()
            for word in words:
                word_chars = ' '.join(list(word)) + ' </w>'
                self.word_freq[word_chars] = self.word_freq.get(word_chars, 0) + 1
        
        # 2. 建立初始词汇表（所有字符）
        for word in self.word_freq:
            for char in word.split():
                self.vocab.add(char)
        
        # 3. 迭代合并最频繁的字符对
        while len(self.vocab) < self.vocab_size:
            pairs = self.get_pairs()
            if not pairs:
                break
                
            best_pair = max(pairs, key=pairs.get)
            self.vocab.add(''.join(best_pair))
            self.merges.append(best_pair)
            
            # 更新词频统计
            self.merge_vocab(best_pair)
    
    def get_pairs(self):
        pairs = defaultdict(int)
        for word, freq in self.word_freq.items():
            symbols = word.split()
            for i in range(len(symbols) - 1):
                pairs[(symbols[i], symbols[i + 1])] += freq
        return pairs
    
    def merge_vocab(self, pair):
        new_word_freq = {}
        bigram = ' '.join(pair)
        replacement = ''.join(pair)
        
        for word in self.word_freq:
            new_word = word.replace(bigram, replacement)
            new_word_freq[new_word] = self.word_freq[word]
        
        self.word_freq = new_word_freq
    
    def encode(self, text):
        tokens = []
        for word in text.split():
            word_tokens = self.bpe_encode(word)
            tokens.extend(word_tokens)
        return tokens
    
    def bpe_encode(self, word):
        word = ' '.join(list(word)) + ' </w>'
        
        # 应用学习到的合并规则
        for pair in self.merges:
            if ' '.join(pair) in word:
                word = word.replace(' '.join(pair), ''.join(pair))
        
        return word.split()

# 使用示例
tokenizer = BPETokenizer(vocab_size=1000)
corpus = ["unhappiness is common", "happiness brings joy", ...]
tokenizer.train(corpus)

# 编码新文本
tokens = tokenizer.encode("unhappiness")
print(f"BPE tokens: {tokens}")
# 可能输出: ['un', 'happi', 'ness', '</w>']

1.3.3 词性标注

词性标注（Part-of-Speech Tagging, POS Tagging）是为文本中每个词汇分配语法范畴标签的基础任务。准确的词性信息对句法分析、语义理解、信息抽取等下游任务具有重要价值。

技术发展脉络：

基于规则的方法：利用人工编写的语法规则进行标注
基于统计的方法：HMM、最大熵模型等概率图模型
基于神经网络的方法：BiLSTM、Transformer等深度架构

评估指标：

准确率（Accuracy）：正确标注的词汇比例
未知词准确率：模型对训练集中未出现词汇的标注性能

python 复制代码

# 基于BiLSTM的词性标注器伪代码
class POSTagger:
    def __init__(self, vocab_size, tag_size, embedding_dim, hidden_dim):
        self.word_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.char_embedding = nn.Embedding(256, 50)  # 字符级特征
        
        # 字符级BiLSTM
        self.char_lstm = nn.LSTM(50, 25, bidirectional=True, batch_first=True)
        
        # 词级BiLSTM  
        self.word_lstm = nn.LSTM(embedding_dim + 50, hidden_dim, 
                                bidirectional=True, batch_first=True)
        
        self.hidden2tag = nn.Linear(hidden_dim * 2, tag_size)
        self.dropout = nn.Dropout(0.5)
    
    def get_char_features(self, words):
        char_features = []
        for word in words:
            chars = [ord(c) for c in word[:20]]  # 截断长词
            char_embeds = self.char_embedding(torch.LongTensor(chars))
            char_lstm_out, (h, c) = self.char_lstm(char_embeds.unsqueeze(0))
            
            # 使用最后时刻的隐状态作为字符特征
            word_char_feature = torch.cat([h[0], h[1]], dim=1)
            char_features.append(word_char_feature)
        
        return torch.cat(char_features, dim=0)
    
    def forward(self, sentence, sentence_chars):
        # 词级嵌入
        word_embeds = self.word_embedding(sentence)
        
        # 字符级特征
        char_features = self.get_char_features(sentence_chars)
        
        # 拼接词嵌入和字符特征
        combined_embeds = torch.cat([word_embeds, char_features.unsqueeze(0)], dim=2)
        
        # BiLSTM编码
        lstm_out, _ = self.word_lstm(combined_embeds)
        lstm_out = self.dropout(lstm_out)
        
        # 标签预测
        tag_logits = self.hidden2tag(lstm_out)
        
        return tag_logits
    
    def predict(self, sentence):
        tag_logits = self.forward(sentence)
        predicted_tags = torch.argmax(tag_logits, dim=2)
        return predicted_tags

# 标注示例
sentence = "She is playing the guitar"
pos_tags = ["PRP", "VBZ", "VBG", "DT", "NN"]

print(f"句子: {sentence}")
print(f"词性标注: {list(zip(sentence.split(), pos_tags))}")
# 输出: [('She', 'PRP'), ('is', 'VBZ'), ('playing', 'VBG'), 
#        ('the', 'DT'), ('guitar', 'NN')]

1.3.4 文本分类

文本分类（Text Classification）是NLP领域的经典任务，旨在根据文本内容将其归类到预定义的类别中。这项技术在垃圾邮件过滤、情感分析、新闻分类、内容审核等场景中有着广泛应用。

任务类型：

二分类：如垃圾邮件检测、情感极性分析
多分类：如新闻主题分类、产品类别识别
多标签分类：如文档标签标注、电影类型分类
层次分类：如学科分类、商品类目分类

技术演进：

传统方法：朴素贝叶斯、SVM + TF-IDF特征
深度学习方法：CNN、RNN、BERT等预训练模型
少样本学习：原型网络、元学习、提示学习

基于BERT的文本分类器伪代码

python 复制代码

# 基于BERT的文本分类器伪代码
class BERTClassifier:
    def __init__(self, bert_model_name, num_classes, max_length=512):
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)
        self.max_length = max_length
        
    def forward(self, input_ids, attention_mask, token_type_ids=None):
        # BERT编码
        outputs = self.bert(input_ids=input_ids,
                           attention_mask=attention_mask,
                           token_type_ids=token_type_ids)
        
        # 使用[CLS]标记的表示进行分类
        pooled_output = outputs.pooler_output
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        
        return logits
    
    def predict(self, texts, tokenizer):
        predictions = []
        
        for text in texts:
            # 文本编码
            encoded = tokenizer.encode_plus(
                text,
                add_special_tokens=True,
                max_length=self.max_length,
                padding='max_length',
                truncation=True,
                return_tensors='pt'
            )
            
            # 模型推理
            with torch.no_grad():
                logits = self.forward(
                    input_ids=encoded['input_ids'],
                    attention_mask=encoded['attention_mask']
                )
                
                predicted_class = torch.argmax(logits, dim=1).item()
                confidence = torch.softmax(logits, dim=1).max().item()
                
                predictions.append({
                    'class': predicted_class,
                    'confidence': confidence
                })
        
        return predictions

多标签文本分类示例

python 复制代码

# 多标签文本分类示例
class MultiLabelTextClassifier:
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_labels):
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.bilstm = nn.LSTM(embedding_dim, hidden_dim, 
                             bidirectional=True, batch_first=True)
        self.attention = nn.MultiheadAttention(hidden_dim * 2, num_heads=8)
        self.classifier = nn.Linear(hidden_dim * 2, num_labels)
        
    def forward(self, input_ids):
        # 嵌入层
        embeddings = self.embedding(input_ids)
        
        # BiLSTM编码
        lstm_out, _ = self.bilstm(embeddings)
        
        # 注意力机制
        attn_out, _ = self.attention(lstm_out, lstm_out, lstm_out)
        
        # 全局平均池化
        pooled = torch.mean(attn_out, dim=1)
        
        # 多标签预测
        logits = self.classifier(pooled)
        
        return torch.sigmoid(logits)  # 多标签使用sigmoid激活

使用示例

python 复制代码

# 使用示例
texts = [
    "苹果公司发布了新款iPhone，搭载强大的A17芯片",
    "美国总统宣布新的经济政策，股市应声上涨", 
    "NBA总决赛即将开打，湖人队备战充分"
]

class_labels = ["科技", "政治", "体育"]
classifier = BERTClassifier('bert-base-chinese', num_classes=3)

predictions = classifier.predict(texts, tokenizer)
for text, pred in zip(texts, predictions):
    predicted_label = class_labels[pred['class']]

1.3.5 实体识别

命名实体识别（Named Entity Recognition, NER）是信息抽取领域的核心任务，旨在从非结构化文本中识别并分类具有特定语义的实体，如人名、地名、机构名、时间、金额等。NER是构建知识图谱、问答系统、信息检索系统的重要基础。

实体类型体系：

通用实体：人名(PER)、地名(LOC)、机构名(ORG)、其他(MISC)
领域实体：蛋白质、基因、疾病(生物医学)；产品、品牌(电商)
细粒度实体：职业、国籍、货币、百分比等数十种类型

技术方案演进：

基于规则的方法：正则表达式、词典匹配、规则模板
基于机器学习：CRF、SVM等特征工程方法
基于深度学习：BiLSTM-CRF、BERT-CRF等端到端模型
少样本/零样本NER：原型学习、提示学习、生成式方法

基于BERT-CRF的命名实体识别系统伪代码

python 复制代码

# 基于BERT-CRF的命名实体识别系统伪代码
class BERTCRFForNER:
    def __init__(self, bert_model_name, num_labels):
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
        self.crf = CRF(num_labels, batch_first=True)
        
        # BIO标注体系
        self.label2id = {
            'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-LOC': 3, 'I-LOC': 4,
            'B-ORG': 5, 'I-ORG': 6, 'B-MISC': 7, 'I-MISC': 8
        }
        self.id2label = {v: k for k, v in self.label2id.items()}
    
    def forward(self, input_ids, attention_mask, labels=None):
        # BERT编码
        bert_output = self.bert(input_ids=input_ids, 
                               attention_mask=attention_mask)
        sequence_output = bert_output.last_hidden_state
        sequence_output = self.dropout(sequence_output)
        
        # 线性分类层
        emissions = self.classifier(sequence_output)
        
        if labels is not None:
            # 训练阶段：计算CRF损失
            loss = -self.crf(emissions, labels, mask=attention_mask.byte())
            return loss
        else:
            # 预测阶段：CRF解码
            predictions = self.crf.decode(emissions, mask=attention_mask.byte())
            return predictions
    
    def extract_entities(self, text, tokenizer):
        # 文本编码
        encoded = tokenizer.encode_plus(
            text, return_tensors='pt', 
            padding=True, truncation=True, max_length=512
        )
        
        # 模型预测
        with torch.no_grad():
            predictions = self.forward(
                input_ids=encoded['input_ids'],
                attention_mask=encoded['attention_mask']
            )[0]
        
        # 解码为实体
        tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
        entities = self.decode_entities(tokens, predictions)
        
        return entities
    
    def decode_entities(self, tokens, labels):
        entities = []
        current_entity = {"text": "", "label": "", "start": -1, "end": -1}
        
        for i, (token, label) in enumerate(zip(tokens, labels)):
            label_name = self.id2label[label]
            
            if label_name.startswith('B-'):
                # 开始新实体
                if current_entity["text"]:
                    entities.append(current_entity.copy())
                
                current_entity = {
                    "text": token,
                    "label": label_name[2:],  # 去掉B-前缀
                    "start": i,
                    "end": i
                }
            elif label_name.startswith('I-') and current_entity["label"] == label_name[2:]:
                # 继续当前实体
                current_entity["text"] += token
                current_entity["end"] = i
            else:
                # 实体结束
                if current_entity["text"]:
                    entities.append(current_entity.copy())
                    current_entity = {"text": "", "label": "", "start": -1, "end": -1}
        
        # 处理最后一个实体
        if current_entity["text"]:
            entities.append(current_entity)
        
        return entities

嵌套命名实体识别

python 复制代码

# 嵌套命名实体识别
class NestedNERModel:
    def __init__(self, bert_model_name, entity_types):
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.entity_types = entity_types
        
        # 为每种实体类型训练一个二分类器
        self.classifiers = nn.ModuleDict({
            entity_type: nn.Linear(self.bert.config.hidden_size, 2)
            for entity_type in entity_types
        })
    
    def forward(self, input_ids, attention_mask):
        bert_output = self.bert(input_ids, attention_mask)
        sequence_output = bert_output.last_hidden_state
        
        # 为每种实体类型预测
        predictions = {}
        for entity_type in self.entity_types:
            logits = self.classifiers[entity_type](sequence_output)
            predictions[entity_type] = torch.softmax(logits, dim=-1)
        
        return predictions

使用示例

python 复制代码

# 使用示例
text = "李雷和韩梅梅是北京市海淀区的居民，他们计划在2024年4月7日去上海旅行"

ner_model = BERTCRFForNER('bert-base-chinese', num_labels=9)
entities = ner_model.extract_entities(text, tokenizer)

print(f"输入文本: {text}")
print("识别的实体:")
for entity in entities:
    print(f"  {entity['text']} -> {entity['label']}")

1.3.6 关系抽取

关系抽取（Relation Extraction, RE）是信息抽取的核心任务之一，旨在从文本中识别实体对之间的语义关系。这项技术是构建知识图谱、智能问答、信息检索系统的关键组件，能够将非结构化文本转换为结构化的三元组知识。

关系类型分类：

语义关系：上下位关系、同义关系、反义关系
事实关系：出生地、就职于、位于、发生于
因果关系：导致、引起、影响、促进
时序关系：之前、之后、同时、期间

技术挑战：

关系的多样性和复杂性：自然语言表达同一关系的方式多样
远程监督的噪声问题：自动标注数据存在标签噪声
少样本关系识别：新领域关系类型的快速适应
关系方向判断：正确识别关系的主语和宾语

基于BERT的关系分类器伪代码

python 复制代码

# 基于BERT的关系分类器伪代码
class BERTRelationClassifier:
    def __init__(self, bert_model_name, num_relations):
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(0.1)
        self.relation_classifier = nn.Linear(self.bert.config.hidden_size, num_relations)
        
        # 实体位置嵌入
        self.entity_embedding = nn.Embedding(4, 50)  # [PAD], [E1], [/E1], [E2], [/E2]
    
    def forward(self, input_ids, attention_mask, entity_positions):
        # BERT编码
        bert_output = self.bert(input_ids, attention_mask)
        sequence_output = bert_output.last_hidden_state
        
        # 实体位置信息融入
        entity_embeds = self.entity_embedding(entity_positions)
        enhanced_output = sequence_output + entity_embeds
        
        # 提取实体对表示
        entity_repr = self.extract_entity_representation(enhanced_output, entity_positions)
        
        # 关系分类
        relation_logits = self.relation_classifier(entity_repr)
        
        return relation_logits
    
    def extract_entity_representation(self, sequence_output, entity_positions):
        batch_size, seq_len, hidden_size = sequence_output.size()
        
        # 提取两个实体的平均表示
        entity1_mask = (entity_positions == 1).float().unsqueeze(-1)  # [E1]标记
        entity2_mask = (entity_positions == 3).float().unsqueeze(-1)  # [E2]标记
        
        entity1_repr = torch.sum(sequence_output * entity1_mask, dim=1) / \
                      torch.sum(entity1_mask, dim=1)
        entity2_repr = torch.sum(sequence_output * entity2_mask, dim=1) / \
                      torch.sum(entity2_mask, dim=1)
        
        # 拼接实体表示
        entity_pair_repr = torch.cat([entity1_repr, entity2_repr], dim=1)
        
        return entity_pair_repr

基于图神经网络的关系抽取

python 复制代码

# 基于图神经网络的关系抽取
class GCNRelationExtractor:
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_relations):
        self.word_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.pos_embedding = nn.Embedding(50, 50)  # 位置嵌入
        
        # 图卷积网络层
        self.gcn_layers = nn.ModuleList([
            GCNLayer(embedding_dim + 50, hidden_dim),
            GCNLayer(hidden_dim, hidden_dim)
        ])
        
        self.classifier = nn.Linear(hidden_dim * 2, num_relations)
    
    def forward(self, words, pos_tags, dependency_adj, head_idx, tail_idx):
        # 词嵌入和位置嵌入
        word_embeds = self.word_embedding(words)
        pos_embeds = self.pos_embedding(pos_tags)
        
        # 拼接特征
        node_features = torch.cat([word_embeds, pos_embeds], dim=-1)
        
        # 图卷积编码
        for gcn_layer in self.gcn_layers:
            node_features = gcn_layer(node_features, dependency_adj)
            node_features = F.relu(node_features)
        
        # 提取头尾实体表示
        head_repr = node_features[head_idx]
        tail_repr = node_features[tail_idx]
        
        # 关系预测
        relation_repr = torch.cat([head_repr, tail_repr], dim=-1)
        relation_logits = self.classifier(relation_repr)
        
        return relation_logits

联合实体关系抽取

python 复制代码

# 联合实体关系抽取
class JointEntityRelationExtractor:
    def __init__(self, bert_model_name, entity_labels, relation_labels):
        self.bert = BertModel.from_pretrained(bert_model_name)
        
        # 实体识别头
        self.entity_classifier = nn.Linear(
            self.bert.config.hidden_size, len(entity_labels))
        
        # 关系识别头（基于实体对）
        self.relation_classifier = nn.Linear(
            self.bert.config.hidden_size * 2, len(relation_labels))
    
    def forward(self, input_ids, attention_mask):
        bert_output = self.bert(input_ids, attention_mask)
        sequence_output = bert_output.last_hidden_state
        
        # 实体预测
        entity_logits = self.entity_classifier(sequence_output)
        
        # 生成所有可能的实体对
        entity_pairs = self.generate_entity_pairs(sequence_output, entity_logits)
        
        # 关系预测
        relation_predictions = []
        for head_repr, tail_repr in entity_pairs:
            pair_repr = torch.cat([head_repr, tail_repr], dim=-1)
            relation_logits = self.relation_classifier(pair_repr)
            relation_predictions.append(relation_logits)
        
        return entity_logits, relation_predictions

使用示例

python 复制代码

# 使用示例
text = "比尔·盖茨是微软公司的创始人"
# 标记实体位置
marked_text = "[E1]比尔·盖茨[/E1]是[E2]微软公司[/E2]的创始人"

relation_extractor = BERTRelationClassifier('bert-base-chinese', num_relations=50)

# 关系类型
relations = {
    0: "创始人",
    1: "CEO", 
    2: "员工",
    3: "投资者",
    # ... 更多关系类型
}

predicted_relation = relation_extractor.predict(marked_text)

1.3.7 文本摘要

文本摘要（Text Summarization）是自动生成简洁、准确、流畅摘要的NLP任务，旨在保留原文的关键信息和主要观点。随着信息爆炸时代的到来，文本摘要技术在新闻聚合、学术文献处理、报告生成等场景中发挥着越来越重要的作用。

技术分类：

抽取式摘要（Extractive Summarization）
- 直接选择原文中的重要句子组成摘要
- 优点：语法正确性好，事实准确性高
- 缺点：连贯性可能较差，表达不够灵活
生成式摘要（Abstractive Summarization）
- 理解原文语义后重新组织语言生成摘要
- 优点：表达灵活，连贯性好
- 缺点：可能产生事实错误，计算复杂度高

评估指标：

ROUGE：基于n-gram重叠的自动评估指标
BLEU：主要用于机器翻译，也可用于摘要评估
BERTScore：基于语义相似度的评估方法

基于Transformer的抽取式摘要系统伪代码

python 复制代码

# 基于Transformer的抽取式摘要系统伪代码
class Extractivesummarizer:
    def __init__(self, bert_model_name, max_seq_length=512):
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.sentence_classifier = nn.Linear(self.bert.config.hidden_size, 1)
        self.max_seq_length = max_seq_length
    
    def forward(self, input_ids, attention_mask, sentence_positions):
        # BERT编码整篇文档
        bert_output = self.bert(input_ids, attention_mask)
        sequence_output = bert_output.last_hidden_state
        
        # 提取每个句子的表示
        sentence_representations = self.extract_sentence_representations(
            sequence_output, sentence_positions)
        
        # 预测每个句子的重要性分数
        importance_scores = self.sentence_classifier(sentence_representations)
        
        return torch.sigmoid(importance_scores)  # 转换为0-1概率
    
    def extract_sentence_representations(self, sequence_output, sentence_positions):
        sentence_reprs = []
        
        for start, end in sentence_positions:
            # 使用句子中所有token的平均表示
            sentence_repr = torch.mean(sequence_output[start:end], dim=0)
            sentence_reprs.append(sentence_repr)
        
        return torch.stack(sentence_reprs)
    
    def summarize(self, document, top_k=3):
        sentences = self.split_sentences(document)
        
        # 编码文档
        encoded = self.encode_document(document)
        
        # 预测句子重要性
        with torch.no_grad():
            importance_scores = self.forward(
                encoded['input_ids'],
                encoded['attention_mask'], 
                encoded['sentence_positions']
            )
        
        # 选择top-k重要句子
        top_indices = torch.topk(importance_scores.squeeze(), top_k).indices
        summary_sentences = [sentences[i] for i in sorted(top_indices)]
        
        return ' '.join(summary_sentences)

多文档摘要系统

python 复制代码

# 多文档摘要系统
class MultiDocumentSummarizer:
    def __init__(self, bert_model_name):
        self.sentence_encoder = SentenceBERT(bert_model_name)
        self.cluster_algorithm = KMeans(n_clusters=5)
    
    def summarize_multiple_documents(self, documents, summary_length=200):
        # 1. 提取所有句子
        all_sentences = []
        doc_ids = []
        
        for doc_id, doc in enumerate(documents):
            sentences = self.split_sentences(doc)
            all_sentences.extend(sentences)
            doc_ids.extend([doc_id] * len(sentences))
        
        # 2. 句子编码
        sentence_embeddings = self.sentence_encoder.encode(all_sentences)
        
        # 3. 句子聚类
        clusters = self.cluster_algorithm.fit_predict(sentence_embeddings)
        
        # 4. 从每个聚类中选择代表句子
        summary_sentences = []
        for cluster_id in set(clusters):
            cluster_sentences = [sent for i, sent in enumerate(all_sentences) 
                               if clusters[i] == cluster_id]
            
            # 选择与聚类中心最近的句子作为代表
            cluster_center = np.mean([sentence_embeddings[i] for i, c in enumerate(clusters) 
                                    if c == cluster_id], axis=0)
            
            best_sentence = self.find_closest_sentence(cluster_sentences, cluster_center)
            summary_sentences.append(best_sentence)
        
        # 5. 按原文顺序重新排列并截断
        summary = ' '.join(summary_sentences[:summary_length])
        
        return summary

基于Seq2Seq的生成式摘要系统

python 复制代码

# 基于Seq2Seq的生成式摘要系统
class AbstractiveSummarizer:
    def __init__(self, model_name='t5-base', max_input_length=1024, max_output_length=128):
        self.tokenizer = T5Tokenizer.from_pretrained(model_name)
        self.model = T5ForConditionalGeneration.from_pretrained(model_name)
        self.max_input_length = max_input_length
        self.max_output_length = max_output_length
    
    def summarize(self, text, num_beams=4, length_penalty=2.0):
        # 添加任务前缀
        input_text = f"summarize: {text}"
        
        # 编码输入
        input_ids = self.tokenizer.encode(
            input_text,
            return_tensors='pt',
            max_length=self.max_input_length,
            truncation=True,
            padding=True
        )
        
        # 生成摘要
        with torch.no_grad():
            summary_ids = self.model.generate(
                input_ids,
                num_beams=num_beams,
                max_length=self.max_output_length,
                length_penalty=length_penalty,
                early_stopping=True
            )
        
        # 解码摘要
        summary = self.tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        
        return summary

层次化摘要系统

python 复制代码

# 层次化摘要系统
class HierarchicalSummarizer:
    def __init__(self):
        self.sentence_summarizer = ExtractiveExtractor()
        self.paragraph_summarizer = AbstractiveSummarizer()
    
    def hierarchical_summarize(self, long_document):
        # 第一层：段落级摘要
        paragraphs = self.split_paragraphs(long_document)
        paragraph_summaries = []
        
        for paragraph in paragraphs:
            if len(paragraph) > 200:  # 长段落需要摘要
                para_summary = self.sentence_summarizer.summarize(paragraph, top_k=2)
                paragraph_summaries.append(para_summary)
            else:
                paragraph_summaries.append(paragraph)
        
        # 第二层：文档级摘要
        combined_text = ' '.join(paragraph_summaries)
        final_summary = self.paragraph_summarizer.summarize(combined_text)
        
        return final_summary

使用示例

python 复制代码

# 使用示例
document = """
2024年第三季度，全球科技行业继续保持强劲增长态势。人工智能技术的快速发展
推动了各个行业的数字化转型进程。大型科技公司纷纷加大在AI领域的投资力度，
推出了众多创新产品和服务。

其中，自然语言处理技术取得了重大突破。新一代大语言模型在理解和生成能力上
达到了前所未有的水平，为智能对话、内容创作、代码生成等应用场景提供了强大
的技术支撑。同时，模型的效率也在不断提升，使得大规模应用成为可能。

展望未来，人工智能技术将继续深度融入各行各业，为社会发展注入新的活力。
预计到2025年，AI相关产业的市场规模将突破万亿美元，成为全球经济增长的
重要引擎。
"""

# 抽取式摘要
extractive_summarizer = ExtractiveExtractor()
extractive_summary = extractive_summarizer.summarize(document, top_k=2)
print(f"抽取式摘要: {extractive_summary}")

# 生成式摘要  
abstractive_summarizer = AbstractiveSummarizer()
abstractive_summary = abstractive_summarizer.summarize(document)
print(f"生成式摘要: {abstractive_summary}")

1.3.8 机器翻译

机器翻译（Machine Translation, MT）是NLP领域最具挑战性和应用价值的任务之一，旨在实现不同自然语言之间的自动转换。现代机器翻译系统不仅要处理词汇的对应关系，更要准确传达源语言的语义、语用信息和文化内涵，实现真正的跨语言交流。

技术发展脉络：

基于规则的机器翻译（RBMT）：依赖专家编写的语法规则和词典
统计机器翻译（SMT）：基于平行语料的统计学习方法
神经机器翻译（NMT）：端到端的神经网络架构
大模型时代的翻译：多语言预训练模型的zero-shot翻译

核心技术挑战：

语序差异：不同语言的句法结构差异巨大
一词多义：相同词汇在不同语境下的翻译选择
文化适应：习语、俚语、文化特定概念的翻译
长距离依赖：长句中跨越多个短语的语义依赖

基于Transformer的神经机器翻译系统伪代码

python 复制代码

# 基于Transformer的神经机器翻译系统伪代码
class TransformerMT:
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, 
                 num_heads=8, num_layers=6, d_ff=2048):
        self.d_model = d_model
        
        # 编码器和解码器嵌入层
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
        
        # 位置编码
        self.pos_encoding = PositionalEncoding(d_model)
        
        # Transformer编码器
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=num_heads, dim_feedforward=d_ff)
        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers)
        
        # Transformer解码器  
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=d_model, nhead=num_heads, dim_feedforward=d_ff)
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers)
        
        # 输出投影层
        self.output_projection = nn.Linear(d_model, tgt_vocab_size)
    
    def forward(self, src_tokens, tgt_tokens=None):
        # 源语言编码
        src_embed = self.src_embedding(src_tokens) * math.sqrt(self.d_model)
        src_embed = self.pos_encoding(src_embed)
        
        # 编码器
        memory = self.encoder(src_embed)
        
        if tgt_tokens is not None:
            # 训练模式：teacher forcing
            tgt_embed = self.tgt_embedding(tgt_tokens) * math.sqrt(self.d_model)
            tgt_embed = self.pos_encoding(tgt_embed)
            
            # 创建目标序列的因果掩码
            tgt_mask = self.generate_square_subsequent_mask(tgt_tokens.size(1))
            
            # 解码器
            decoder_output = self.decoder(tgt_embed, memory, tgt_mask=tgt_mask)
            
            # 输出层
            logits = self.output_projection(decoder_output)
            
            return logits
        else:
            # 推理模式：自回归生成
            return self.generate(memory)
    
    def generate_square_subsequent_mask(self, size):
        mask = torch.triu(torch.ones(size, size), diagonal=1)
        return mask.bool()
    
    def generate(self, memory, max_length=100, bos_token=1, eos_token=2):
        batch_size = memory.size(0)
        generated = torch.LongTensor([[bos_token]] * batch_size)
        
        for _ in range(max_length):
            tgt_embed = self.tgt_embedding(generated) * math.sqrt(self.d_model)
            tgt_embed = self.pos_encoding(tgt_embed)
            
            tgt_mask = self.generate_square_subsequent_mask(generated.size(1))
            
            decoder_output = self.decoder(tgt_embed, memory, tgt_mask=tgt_mask)
            
            # 预测下一个词
            next_token_logits = self.output_projection(decoder_output[:, -1, :])
            next_token = torch.argmax(next_token_logits, dim=-1, keepdim=True)
            
            generated = torch.cat([generated, next_token], dim=1)
            
            # 检查是否所有序列都结束
            if torch.all(next_token.squeeze() == eos_token):
                break
        
        return generated

多语言机器翻译系统

python 复制代码

# 多语言机器翻译系统
class MultilingualMT:
    def __init__(self, languages, shared_vocab_size=50000):
        self.languages = languages
        self.num_languages = len(languages)
        
        # 共享的多语言编码器
        self.shared_encoder = TransformerEncoder(
            vocab_size=shared_vocab_size,
            d_model=512,
            num_heads=8,
            num_layers=12
        )
        
        # 语言特定的解码器
        self.decoders = nn.ModuleDict({
            lang: TransformerDecoder(
                vocab_size=shared_vocab_size,
                d_model=512,
                num_heads=8, 
                num_layers=6
            ) for lang in languages
        })
        
        # 语言标识嵌入
        self.lang_embedding = nn.Embedding(self.num_languages, 512)
    
    def forward(self, src_tokens, src_lang, tgt_tokens, tgt_lang):
        # 添加源语言标识
        src_lang_id = self.get_lang_id(src_lang)
        src_lang_embed = self.lang_embedding(src_lang_id)
        
        # 共享编码器编码
        encoder_output = self.shared_encoder(src_tokens, src_lang_embed)
        
        # 目标语言特定解码器
        tgt_decoder = self.decoders[tgt_lang]
        
        # 添加目标语言标识
        tgt_lang_id = self.get_lang_id(tgt_lang)
        tgt_lang_embed = self.lang_embedding(tgt_lang_id)
        
        # 解码
        output = tgt_decoder(tgt_tokens, encoder_output, tgt_lang_embed)
        
        return output

基于注意力的对齐可视化

python 复制代码

# 基于注意力的对齐可视化
class AttentionVisualizer:
    def __init__(self, model):
        self.model = model
        self.attention_weights = {}
        
        # 注册hook收集注意力权重
        self.register_hooks()
    
    def register_hooks(self):
        def attention_hook(module, input, output):
            if hasattr(module, 'attention_weights'):
                layer_name = module.__class__.__name__
                self.attention_weights[layer_name] = output[1]  # attention weights
        
        for layer in self.model.modules():
            if isinstance(layer, nn.MultiheadAttention):
                layer.register_forward_hook(attention_hook)
    
    def visualize_alignment(self, src_sentence, tgt_sentence, src_tokens, tgt_tokens):
        # 获取翻译过程中的注意力权重
        with torch.no_grad():
            _ = self.model(src_tokens, tgt_tokens)
        
        # 提取编码器-解码器注意力权重
        enc_dec_attention = self.attention_weights.get('MultiheadAttention', None)
        
        if enc_dec_attention is not None:
            # 绘制注意力热力图
            self.plot_attention_heatmap(
                src_sentence, tgt_sentence, enc_dec_attention)
    
    def plot_attention_heatmap(self, src_words, tgt_words, attention_matrix):
        import matplotlib.pyplot as plt
        import seaborn as sns
        
        plt.figure(figsize=(10, 8))
        sns.heatmap(attention_matrix, 
                   xticklabels=src_words,
                   yticklabels=tgt_words,
                   cmap='Blues')
        plt.title('Translation Attention Alignment')
        plt.xlabel('Source Sentence')
        plt.ylabel('Target Sentence')
        plt.show()

翻译质量评估

python 复制代码

# 翻译质量评估
class TranslationEvaluator:
    def __init__(self):
        self.bleu = BLEUScore()
        self.meteor = METEORScore()
        self.bertscore = BERTScore()
    
    def evaluate_translation(self, predictions, references):
        results = {}
        
        # BLEU评分
        results['bleu'] = self.bleu.compute(predictions, references)
        
        # METEOR评分  
        results['meteor'] = self.meteor.compute(predictions, references)
        
        # BERTScore评分
        results['bertscore'] = self.bertscore.compute(predictions, references)
        
        # 人工评估指标
        results['adequacy'] = self.human_adequacy_score(predictions, references)
        results['fluency'] = self.human_fluency_score(predictions)
        
        return results

使用示例

python 复制代码

# 使用示例
# 中英翻译系统
zh_en_translator = TransformerMT(
    src_vocab_size=50000,  # 中文词汇表
    tgt_vocab_size=30000,  # 英文词汇表
    d_model=512,
    num_heads=8,
    num_layers=6
)

# 翻译示例
chinese_text = "今天天气很好"
english_translation = zh_en_translator.translate(chinese_text)

print(f"中文原文: {chinese_text}")
print(f"英文翻译: {english_translation}")
# 预期输出: "The weather is very nice today"

# 多语言翻译示例
multilingual_translator = MultilingualMT(['zh', 'en', 'fr', 'de', 'ja'])

# 中文到法语的翻译
french_translation = multilingual_translator.translate(
    text="我喜欢学习自然语言处理", 
    src_lang='zh', 
    tgt_lang='fr'
)

print(f"中文: 我喜欢学习自然语言处理")
print(f"法语: {french_translation}")
# 预期输出: "J'aime étudier le traitement du langage naturel"

1.3.9 自动问答

自动问答（Automatic Question Answering, QA）是NLP领域的高级任务，旨在构建能够理解自然语言问题并提供准确答案的智能系统。这项技术融合了信息检索、阅读理解、知识推理等多项NLP能力，是实现人机智能交互的关键技术。

问答系统分类：

事实型问答：回答具体事实性问题（如"北京的人口是多少？"）
定义型问答：解释概念或术语（如"什么是深度学习？"）
推理型问答：需要逻辑推理的复杂问题
对话型问答：支持多轮对话的交互式问答

技术架构类型：

检索式问答（Retrieval-based QA）：从文档集合中检索相关段落并提取答案
生成式问答（Generative QA）：直接生成自然语言答案
混合式问答（Hybrid QA）：结合检索和生成的优势

基于检索增强生成的问答系统伪代码

python 复制代码

# 基于检索增强生成的问答系统伪代码
class RetrievalAugmentedQA:
    def __init__(self, document_encoder, question_encoder, generator):
        self.document_encoder = document_encoder  # 文档编码器
        self.question_encoder = question_encoder  # 问题编码器  
        self.generator = generator                # 答案生成器
        self.document_index = None               # 文档索引
        
    def build_index(self, documents):
        """构建文档索引"""
        document_embeddings = []
        
        for doc in documents:
            # 将文档分割成段落
            passages = self.split_into_passages(doc)
            
            for passage in passages:
                # 编码段落
                passage_embedding = self.document_encoder.encode(passage)
                document_embeddings.append({
                    'text': passage,
                    'embedding': passage_embedding,
                    'doc_id': doc['id']
                })
        
        # 构建向量检索索引
        self.document_index = FAISSIndex(document_embeddings)
    
    def retrieve_relevant_passages(self, question, top_k=5):
        """检索相关段落"""
        question_embedding = self.question_encoder.encode(question)
        
        # 在索引中搜索最相似的段落
        similar_passages = self.document_index.search(
            question_embedding, top_k=top_k)
        
        return similar_passages
    
    def answer_question(self, question, top_k=5):
        """回答问题"""
        # 1. 检索相关段落
        relevant_passages = self.retrieve_relevant_passages(question, top_k)
        
        # 2. 构建生成的输入
        context = self.combine_passages(relevant_passages)
        generation_input = f"Question: {question}\nContext: {context}\nAnswer:"
        
        # 3. 生成答案
        answer = self.generator.generate(generation_input)
        
        # 4. 后处理和验证
        verified_answer = self.verify_answer(answer, context)
        
        return {
            'answer': verified_answer,
            'confidence': self.calculate_confidence(answer, context),
            'sources': [p['doc_id'] for p in relevant_passages]
        }

基于BERT的阅读理解问答系统

python 复制代码

# 基于BERT的阅读理解问答系统
class BERTReadingComprehension:
    def __init__(self, bert_model_name):
        self.bert = BertForQuestionAnswering.from_pretrained(bert_model_name)
        self.tokenizer = BertTokenizer.from_pretrained(bert_model_name)
        self.max_length = 512
    
    def answer_question(self, question, context):
        # 编码问题和上下文
        encoded = self.tokenizer.encode_plus(
            question, context,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation='only_second',  # 只截断context
            return_tensors='pt'
        )
        
        input_ids = encoded['input_ids']
        attention_mask = encoded['attention_mask']
        token_type_ids = encoded['token_type_ids']
        
        # 模型预测
        with torch.no_grad():
            outputs = self.bert(input_ids=input_ids,
                               attention_mask=attention_mask,
                               token_type_ids=token_type_ids)
        
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits
        
        # 找到最佳答案span
        start_idx = torch.argmax(start_logits)
        end_idx = torch.argmax(end_logits)
        
        # 确保end >= start
        if end_idx < start_idx:
            end_idx = start_idx
        
        # 解码答案
        answer_tokens = input_ids[0][start_idx:end_idx+1]
        answer = self.tokenizer.decode(answer_tokens, skip_special_tokens=True)
        
        # 计算置信度
        start_confidence = torch.softmax(start_logits, dim=1).max().item()
        end_confidence = torch.softmax(end_logits, dim=1).max().item()
        confidence = (start_confidence + end_confidence) / 2
        
        return {
            'answer': answer,
            'confidence': confidence,
            'start_position': start_idx.item(),
            'end_position': end_idx.item()
        }

多跳推理问答系统

python 复制代码

# 多跳推理问答系统
class MultiHopQA:
    def __init__(self, knowledge_graph, reasoning_model):
        self.knowledge_graph = knowledge_graph
        self.reasoning_model = reasoning_model
        self.entity_linker = EntityLinker()
    
    def answer_complex_question(self, question):
        # 1. 问题分解
        sub_questions = self.decompose_question(question)
        
        # 2. 实体链接
        entities = self.entity_linker.link_entities(question)
        
        # 3. 多跳推理
        reasoning_path = []
        current_entities = entities
        
        for sub_question in sub_questions:
            # 在知识图谱中寻找相关路径
            related_facts = self.knowledge_graph.find_related_facts(
                current_entities, sub_question)
            
            reasoning_path.append({
                'sub_question': sub_question,
                'facts': related_facts,
                'entities': current_entities
            })
            
            # 更新当前实体集合
            current_entities = self.extract_new_entities(related_facts)
        
        # 4. 综合推理得出最终答案
        final_answer = self.reasoning_model.synthesize_answer(
            question, reasoning_path)
        
        return {
            'answer': final_answer,
            'reasoning_path': reasoning_path,
            'confidence': self.calculate_reasoning_confidence(reasoning_path)
        }

对话式问答系统

python 复制代码

# 对话式问答系统
class ConversationalQA:
    def __init__(self, qa_model, context_manager):
        self.qa_model = qa_model
        self.context_manager = context_manager
        self.conversation_history = []
    
    def answer_in_context(self, question, session_id):
        # 1. 获取对话上下文
        conversation_context = self.context_manager.get_context(session_id)
        
        # 2. 问题理解和指代消解
        resolved_question = self.resolve_references(
            question, conversation_context)
        
        # 3. 意图识别
        intent = self.classify_intent(resolved_question)
        
        if intent == 'follow_up':
            # 基于上下文的追问
            enhanced_question = self.enhance_with_context(
                resolved_question, conversation_context)
        elif intent == 'clarification':
            # 澄清型问题
            return self.handle_clarification(resolved_question, conversation_context)
        else:
            # 新的独立问题
            enhanced_question = resolved_question
        
        # 4. 回答问题
        answer = self.qa_model.answer_question(enhanced_question)
        
        # 5. 更新对话上下文
        self.context_manager.update_context(
            session_id, question, answer, resolved_question)
        
        return answer

问答系统评估框架

python 复制代码

# 问答系统评估框架
class QAEvaluator:
    def __init__(self):
        self.exact_match = ExactMatch()
        self.f1_score = F1Score()
        self.bleu_score = BLEUScore()
        self.human_evaluator = HumanEvaluator()
    
    def comprehensive_evaluation(self, qa_system, test_dataset):
        results = {
            'exact_match': 0,
            'f1_score': 0,
            'bleu_score': 0,
            'answer_coverage': 0,
            'response_time': [],
            'human_ratings': {}
        }
        
        predictions = []
        references = []
        
        for sample in test_dataset:
            question = sample['question']
            ground_truth = sample['answer']
            context = sample.get('context', '')
            
            # 测量响应时间
            start_time = time.time()
            prediction = qa_system.answer_question(question, context)
            response_time = time.time() - start_time
            
            predictions.append(prediction['answer'])
            references.append(ground_truth)
            results['response_time'].append(response_time)
        
        # 自动评估指标
        results['exact_match'] = self.exact_match.compute(predictions, references)
        results['f1_score'] = self.f1_score.compute(predictions, references)
        results['bleu_score'] = self.bleu_score.compute(predictions, references)
        
        # 人工评估
        results['human_ratings'] = self.human_evaluator.evaluate(
            predictions, references, test_dataset)
        
        return results

# 使用示例
# 构建检索增强问答系统
documents = [
    {"id": "doc1", "text": "BERT是Google在2018年发布的预训练语言模型..."},
    {"id": "doc2", "text": "Transformer架构使用了自注意力机制..."},
    # 更多文档...
]

rag_qa = RetrievalAugmentedQA(
    document_encoder=SentenceBERT('sentence-transformers/all-MiniLM-L6-v2'),
    question_encoder=SentenceBERT('sentence-transformers/all-MiniLM-L6-v2'),
    generator=T5Generator('t5-base')
)

rag_qa.build_index(documents)

问答示例

python 复制代码

# 问答示例
question = "BERT模型是什么时候发布的？"
answer = rag_qa.answer_question(question)

print(f"问题: {question}")
print(f"答案: {answer['answer']}")
print(f"置信度: {answer['confidence']:.3f}")
print(f"来源: {answer['sources']}")

# 阅读理解问答
reading_qa = BERTReadingComprehension('bert-base-chinese')

context = "BERT（Bidirectional Encoder Representations from Transformers）是Google在2018年提出的预训练语言模型。"
question = "BERT是哪家公司提出的？"

result = reading_qa.answer_question(question, context)
print(f"问题: {question}")
print(f"答案: {result['answer']}")
print(f"置信度: {result['confidence']:.3f}")