Datawhale 大模型算法全栈基础篇 202602第4次笔记

笔记：

注意力机制这一节的代码程序从第1天早上10:00开始跑，到第2天14:33跑完。但效果不理想，还要重新调整。

第一节 Seq2Seq 架构

一、什么是 Seq2Seq？

想象一下，要把一句中文翻译成英文，不能一个词一个词地翻译，因为两种语言的词序和长度可能不一样。比如：

中文："我爱人工智能"（3个词）
英文："I love artificial intelligence"（4个词）

怎么办？我们可以这样做：

先完整地读懂中文句子，理解它的意思。
然后根据理解的意思，用英文把这句话说出来。

这就是 Seq2Seq（序列到序列） 的核心思想。它由两个部分组成：

编码器（Encoder） ：负责"阅读"输入序列（如中文），并把整个句子的意思浓缩成一个 上下文向量（就像记在心里的"语义概要"）。
解码器（Decoder）：负责根据这个上下文向量，一步步"说出"输出序列（如英文）。

二、编码器和解码器长什么样？

在最早期的 Seq2Seq 中，编码器和解码器通常是用 RNN 或 LSTM 实现的。

2.1 编码器

它按顺序读入输入句子的每个词（比如"我"、"爱"、"人工智能"）。
每读一个词，它都会更新自己的"记忆"（隐藏状态）。
当读完最后一个词时，它最后的记忆就是整个句子的上下文向量。

2.2 解码器

它以编码器最后的记忆作为自己的初始记忆。
它从一个特殊的 <SOS>（句子开始符） 开始，然后一步步生成输出词。
每生成一个词，它就把这个词作为下一步的输入，继续生成下一个词，直到生成 <EOS>（句子结束符） 为止。

这个过程就像翻译时，先在脑子里理解原文，然后从第一个词开始造句，造完一句就停下来。

三、训练时的小技巧：教师强制

在训练时，我们希望模型学得快、学得准。如果让解码器用自己的预测来生成下一步，一旦某步错了，后面就会越错越离谱，训练会很慢。

所以，训练时我们采用 教师强制（Teacher Forcing）：

在每一步，我们不使用模型上一步的预测，而是直接告诉它正确的词是什么（从真实答案里拿）。
这样每一步都有标准答案，模型学习效率更高。

当然，推理（真正翻译）时没有答案，只能用自己的预测，这叫 自回归生成。

四、Seq2Seq 的局限：信息瓶颈

虽然 Seq2Seq 很厉害，但它有一个大问题：编码器必须把整个句子的信息压缩到一个固定长度的向量里 。如果句子很长，这个向量可能装不下所有细节，导致开头的重要信息丢失。这个问题叫做信息瓶颈。

后来科学家发明了 注意力机制，让解码器在生成每个词时都能回头去看输入序列的不同部分，大大缓解了这个问题。

复制代码

seq2seq_translation.py

python 复制代码

"""
seq2seq_translation.py
一个极简的英法翻译 Seq2Seq 模型（基于 LSTM）
使用教师强制训练，贪心解码推理
"""

import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# ==================== 1. 设置随机种子 ====================
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

set_seed(42)

# ==================== 2. 准备数据 ====================
pairs = [
    ("hello", "bonjour"),
    ("goodbye", "au revoir"),
    ("thank you", "merci"),
    ("how are you", "comment allez vous"),
    ("i love you", "je t aime"),
    ("what is your name", "comment vous appelez vous"),
    ("my name is", "je m appelle"),
    ("nice to meet you", "ravi de vous rencontrer"),
    ("see you tomorrow", "à demain"),
    ("good morning", "bonjour")
]

# 构建词汇表
def build_vocab(sentences):
    word2idx = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2, "<UNK>": 3}
    idx2word = {0: "<PAD>", 1: "<SOS>", 2: "<EOS>", 3: "<UNK>"}
    for sent in sentences:
        for word in sent.split():
            if word not in word2idx:
                idx = len(word2idx)
                word2idx[word] = idx
                idx2word[idx] = word
    return word2idx, idx2word

# 收集所有英文和法文句子
eng_sentences = [pair[0] for pair in pairs]
fra_sentences = [pair[1] for pair in pairs]

eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
fra_word2idx, fra_idx2word = build_vocab(fra_sentences)

print(f"英文词表大小: {len(eng_word2idx)}")
print(f"法文词表大小: {len(fra_word2idx)}")

# 将句子转为 ID 序列（并添加 <SOS> 和 <EOS>）
def sentence_to_ids(sentence, word2idx, max_len=None):
    words = sentence.split()
    ids = [word2idx.get(w, word2idx["<UNK>"]) for w in words]
    ids = [word2idx["<SOS>"]] + ids + [word2idx["<EOS>"]]
    if max_len is not None:
        if len(ids) > max_len:
            ids = ids[:max_len-1] + [word2idx["<EOS>"]]
        else:
            ids += [word2idx["<PAD>"]] * (max_len - len(ids))
    return ids

# 为了统一批次长度，先确定最大长度（这里设为最长的句子长度+2）
max_eng_len = max(len(s.split()) for s in eng_sentences) + 2   # +2 for SOS/EOS
max_fra_len = max(len(s.split()) for s in fra_sentences) + 2

# 构建数据集
class TranslationDataset(Dataset):
    def __init__(self, pairs, eng_word2idx, fra_word2idx, max_eng_len, max_fra_len):
        self.pairs = pairs
        self.eng_word2idx = eng_word2idx
        self.fra_word2idx = fra_word2idx
        self.max_eng_len = max_eng_len
        self.max_fra_len = max_fra_len

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        eng, fra = self.pairs[idx]
        eng_ids = sentence_to_ids(eng, self.eng_word2idx, self.max_eng_len)
        fra_ids = sentence_to_ids(fra, self.fra_word2idx, self.max_fra_len)
        return {
            "eng": torch.tensor(eng_ids, dtype=torch.long),
            "fra": torch.tensor(fra_ids, dtype=torch.long)
        }

def collate_fn(batch):
    eng_batch = torch.stack([item["eng"] for item in batch])
    fra_batch = torch.stack([item["fra"] for item in batch])
    return {"src": eng_batch, "trg": fra_batch}

dataset = TranslationDataset(pairs, eng_word2idx, fra_word2idx, max_eng_len, max_fra_len)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)

# ==================== 3. 定义模型（课本提供的代码）====================
class Encoder(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size, hidden_size, num_layers, batch_first=True)

    def forward(self, x):
        embedded = self.embedding(x)
        _, (hidden, cell) = self.rnn(embedded)
        return hidden, cell

class Decoder(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, hidden, cell):
        # x shape: (batch_size)
        x = x.unsqueeze(1)                     # (batch_size, 1)
        embedded = self.embedding(x)            # (batch_size, 1, hidden_size)
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        prediction = self.fc(output.squeeze(1)) # (batch_size, vocab_size)
        return prediction, hidden, cell

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.shape[0]
        trg_len = trg.shape[1]
        trg_vocab_size = self.decoder.fc.out_features

        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)
        hidden, cell = self.encoder(src)

        # 第一个输入是 <SOS>
        input = trg[:, 0]

        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[:, t, :] = output

            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[:, t] if teacher_force else top1

        return outputs

# ==================== 4. 训练准备 ====================
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"使用设备: {device}")

# 超参数
hidden_size = 128
num_layers = 2
learning_rate = 0.001
epochs = 50

# 初始化模型
encoder = Encoder(len(eng_word2idx), hidden_size, num_layers)
decoder = Decoder(len(fra_word2idx), hidden_size, num_layers)
model = Seq2Seq(encoder, decoder, device).to(device)

criterion = nn.CrossEntropyLoss(ignore_index=0)   # 忽略 <PAD>
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# ==================== 5. 训练循环 ====================
for epoch in range(1, epochs+1):
    model.train()
    total_loss = 0
    for batch in dataloader:
        src = batch["src"].to(device)
        trg = batch["trg"].to(device)

        optimizer.zero_grad()
        output = model(src, trg, teacher_forcing_ratio=0.5)

        # 计算损失：忽略第一个时间步（因为 output 的第一个时间步是 <PAD>？我们实际上从 t=1 开始输出，但 output 的形状是 (batch, trg_len, vocab)，其中 t=0 位置是0，我们只计算 t=1 到 trg_len-1 的损失，目标也要对齐
        # 标准做法：输出 output[:, 1:, :] 与 trg[:, 1:] 比较
        output_dim = output.shape[-1]
        output = output[:, 1:, :].reshape(-1, output_dim)
        trg = trg[:, 1:].reshape(-1)

        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5)
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(dataloader)
    if epoch % 10 == 0:
        print(f"Epoch {epoch:2d}/{epochs}, Loss: {avg_loss:.4f}")

# ==================== 6. 推理（贪心解码）====================
def translate(model, sentence, eng_word2idx, fra_idx2word, max_len=20):
    model.eval()
    # 将输入句子转为 ID 并添加 <SOS> 和 <EOS>
    tokens = sentence.split()
    ids = [eng_word2idx.get(w, eng_word2idx["<UNK>"]) for w in tokens]
    ids = [eng_word2idx["<SOS>"]] + ids + [eng_word2idx["<EOS>"]]
    src_tensor = torch.tensor([ids], dtype=torch.long).to(device)

    with torch.no_grad():
        hidden, cell = model.encoder(src_tensor)
        trg_indexes = [fra_word2idx["<SOS>"]]
        for _ in range(max_len):
            trg_tensor = torch.tensor([trg_indexes[-1]], dtype=torch.long).to(device)
            output, hidden, cell = model.decoder(trg_tensor, hidden, cell)
            pred_token = output.argmax(1).item()
            trg_indexes.append(pred_token)
            if pred_token == fra_word2idx["<EOS>"]:
                break
    # 忽略 <SOS> 和 <EOS>
    translated = [fra_idx2word[idx] for idx in trg_indexes[1:-1]]
    return " ".join(translated)

# 测试几个句子
test_sentences = ["hello", "goodbye", "thank you", "how are you", "i love you"]
for sent in test_sentences:
    translation = translate(model, sent, eng_word2idx, fra_idx2word)
    print(f"English: {sent:20} -> French: {translation}")

输出：

(base) PS E:\Datawhale 2026\base-llm202602> & D:/Users/app/miniconda3/envs/base-llm/python.exe "e:/Datawhale 2026/base-llm202602/04_1_seq2seq_translation.py"

英文词表大小: 24

法文词表大小: 22

使用设备: cuda

Epoch 10/50, Loss: 1.8116

Epoch 20/50, Loss: 0.5740

Epoch 30/50, Loss: 0.2137

Epoch 40/50, Loss: 0.1066

Epoch 50/50, Loss: 0.0588

English: hello -> French: bonjour

English: goodbye -> French: au

English: thank you -> French: merci

English: how are you -> French: comment vous vous

English: i love you -> French: je t aime

📊 输出结果解读

英文	法文（模型输出）	标准答案	分析
hello	bonjour	bonjour	✅ 正确
goodbye	au	au revoir	❌ 只翻译了前半部分，缺少 "revoir"
thank you	merci	merci	✅ 正确
how are you	comment vous vous	comment allez vous	❌ 错误，重复 "vous"，缺少 "allez"
i love you	je t aime	je t aime	✅ 正确

观察

简单短语（hello, thank you, i love you）翻译正确。
较长或带空格的短语（goodbye, how are you）翻译不完整或有重复。
训练损失下降得非常好（从 ~1.8 降到 0.0588），说明模型在训练集上已经"记住"了数据。

🔍 为什么会这样？

1. 数据集太小

我们只有 10 个句子对。模型相当于在"背"这些句子，而不是真正学会翻译规则。对于训练集中出现过的简单句子，它能准确输出；但对于需要组合的短语（如 "goodbye" 对应两个词 "au revoir"），它可能只记住了第一个词，或者在生成过程中提前终止了。

2. 贪心解码的局限

推理时我们用的是贪心解码，每一步都选概率最高的词。但如果模型在某一步预测概率分布不够集中，可能会选错词，然后影响后续所有步骤。例如在生成 "goodbye" 时，第一步可能正确生成了 "au"，但第二步模型可能认为 "<EOS>" 的概率更高（因为训练数据中 "goodbye" 对应的目标序列是 "au revoir <EOS>"，但模型没有充分学习到第二步必须生成 "revoir"），于是提前终止。

3. 训练与推理的差异（Exposure Bias）

训练时我们用了教师强制（一半概率用真实词，一半用预测词），这能让模型更快收敛。但推理时完全用预测词，模型可能不习惯这种"自食其果"的模式，导致误差累积。

第二节注意力机制

一、为什么需要注意力？

想象一下，要把一句中文翻译成英文，比如"两个黄鹂鸣翠柳"。在翻译时，不会把整句话都记在脑子里再一股脑说出来，而是会这样做：

说"两个"的时候，会看看原文的"两个"；
说"黄鹂"的时候，会看看原文的"黄鹂"；
说"鸣"的时候，会看看原文的"鸣"；
说"翠柳"的时候，会看看原文的"翠柳"。

也就是说，在生成每个词的时候，会特别关注原文中和当前词最相关的部分。这就是"注意力"！

但是，普通的 Seq2Seq 模型做不到这一点。它把整句话压缩成一个向量，生成每个词时都依赖这个相同的向量，所以它不知道在说"两个"时应该重点关注"两个"，在说"黄鹂"时重点关注"黄鹂"。这会导致开头的重要信息被遗忘，翻译质量下降。

二、注意力机制的原理

注意力机制就像一个会"回头看"的翻译员。它在生成每个词时，都会回头看看整个输入句子，并给每个词分配一个"重要程度"（注意力权重），然后把所有词按照重要程度加权平均，得到一个针对当前时刻的"上下文向量"。这个过程可以分成三步：

计算相似度：用当前解码器的状态（想问什么）和每个编码器的状态（输入句子中每个词的信息）比较，得到一个分数。分数越高，说明这个词和当前需求越相关。
计算权重：把这些分数用 softmax 变成一组加起来等于 1 的权重，权重越大表示越重要。
加权求和：用这些权重对编码器的所有状态进行加权平均，得到当前时刻专属的上下文向量。

最后，把这个上下文向量和当前输入词一起送给解码器，解码器就能更好地生成下一个词了。

三、注意力机制的数学抽象：QKV 范式

注意力机制还可以用一个更通用的框架来理解：查询（Query） 、键（Key） 、值（Value）。

查询（Q）：当前解码器的状态，表示"我现在需要什么信息"。
键（K）：输入序列中每个词的信息，相当于"这里有信息，请根据查询来匹配"。
值（V）：输入序列中每个词的实际内容（通常和键相同）。

计算过程就是：用 Q 和每个 K 计算相似度（得分），然后用 softmax 得到权重，最后用权重对 V 进行加权平均，得到输出。

一个经典的公式是 缩放点积注意力：

Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V)=softmax(dkQKT)V

其中 dkdk 是键的维度，除以根号是为了防止点积太大导致 softmax 梯度消失。

attention_seq2seq.py

python 复制代码

"""
attention_seq2seq.py
带注意力的英法翻译模型（基于 LSTM）
包含两种注意力模块：无参数和带参数
"""

import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# ==================== 1. 设置随机种子 ====================
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

set_seed(42)

# ==================== 2. 准备数据 ====================
pairs = [
    ("hello", "bonjour"),
    ("goodbye", "au revoir"),
    ("thank you", "merci"),
    ("how are you", "comment allez vous"),
    ("i love you", "je t aime"),
    ("what is your name", "comment vous appelez vous"),
    ("my name is", "je m appelle"),
    ("nice to meet you", "ravi de vous rencontrer"),
    ("see you tomorrow", "à demain"),
    ("good morning", "bonjour")
]

def build_vocab(sentences):
    word2idx = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2, "<UNK>": 3}
    idx2word = {0: "<PAD>", 1: "<SOS>", 2: "<EOS>", 3: "<UNK>"}
    for sent in sentences:
        for word in sent.split():
            if word not in word2idx:
                idx = len(word2idx)
                word2idx[word] = idx
                idx2word[idx] = word
    return word2idx, idx2word

eng_sentences = [pair[0] for pair in pairs]
fra_sentences = [pair[1] for pair in pairs]
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
fra_word2idx, fra_idx2word = build_vocab(fra_sentences)

print(f"英文词表大小: {len(eng_word2idx)}")
print(f"法文词表大小: {len(fra_word2idx)}")

def sentence_to_ids(sentence, word2idx, max_len=None):
    words = sentence.split()
    ids = [word2idx.get(w, word2idx["<UNK>"]) for w in words]
    ids = [word2idx["<SOS>"]] + ids + [word2idx["<EOS>"]]
    if max_len is not None:
        if len(ids) > max_len:
            ids = ids[:max_len-1] + [word2idx["<EOS>"]]
        else:
            ids += [word2idx["<PAD>"]] * (max_len - len(ids))
    return ids

max_eng_len = max(len(s.split()) for s in eng_sentences) + 2
max_fra_len = max(len(s.split()) for s in fra_sentences) + 2

class TranslationDataset(Dataset):
    def __init__(self, pairs, eng_w2i, fra_w2i, max_eng_len, max_fra_len):
        self.pairs = pairs
        self.eng_w2i = eng_w2i
        self.fra_w2i = fra_w2i
        self.max_eng_len = max_eng_len
        self.max_fra_len = max_fra_len

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        eng, fra = self.pairs[idx]
        eng_ids = sentence_to_ids(eng, self.eng_w2i, self.max_eng_len)
        fra_ids = sentence_to_ids(fra, self.fra_w2i, self.max_fra_len)
        return {
            "eng": torch.tensor(eng_ids, dtype=torch.long),
            "fra": torch.tensor(fra_ids, dtype=torch.long)
        }

def collate_fn(batch):
    eng_batch = torch.stack([item["eng"] for item in batch])
    fra_batch = torch.stack([item["fra"] for item in batch])
    return {"src": eng_batch, "trg": fra_batch}

dataset = TranslationDataset(pairs, eng_word2idx, fra_word2idx, max_eng_len, max_fra_len)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)

# ==================== 3. 定义模型组件 ====================
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"使用设备: {device}")

class Encoder(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size, hidden_size, num_layers,
                           batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_size * 2, hidden_size)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, (hidden, cell) = self.rnn(embedded)
        outputs = torch.tanh(self.fc(outputs))
        return outputs, hidden, cell

class AttentionParams(nn.Module):
    """带参数注意力（可学习对齐）"""
    def __init__(self, hidden_size):
        super(AttentionParams, self).__init__()
        self.attn = nn.Linear(hidden_size * 2, hidden_size)
        self.v = nn.Parameter(torch.randn(hidden_size))

    def forward(self, hidden, encoder_outputs):
        batch_size, src_len = encoder_outputs.shape[:2]
        hidden_last = hidden[-1].unsqueeze(1).repeat(1, src_len, 1)
        combined = torch.cat((hidden_last, encoder_outputs), dim=2)
        energy = torch.tanh(self.attn(combined))
        scores = torch.einsum("bsh,h->bs", energy, self.v)
        attn_weights = torch.softmax(scores, dim=1)
        return attn_weights

class DecoderWithAttention(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers, attention_module):
        super(DecoderWithAttention, self).__init__()
        self.attention = attention_module
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size * 2, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, hidden, cell, encoder_outputs):
        embedded = self.embedding(x.unsqueeze(1))
        attn_weights = self.attention(hidden, encoder_outputs)
        context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs)
        rnn_input = torch.cat((embedded, context), dim=2)
        output, (hidden, cell) = self.rnn(rnn_input, (hidden, cell))
        predictions = self.fc(output.squeeze(1))
        return predictions, hidden, cell

class Seq2SeqAttention(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2SeqAttention, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size, trg_len = trg.shape
        trg_vocab_size = self.decoder.fc.out_features
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

        encoder_outputs, hidden, cell = self.encoder(src)

        # 合并双向状态
        hidden = hidden.view(self.encoder.rnn.num_layers, 2, batch_size, -1).sum(dim=1)
        cell = cell.view(self.encoder.rnn.num_layers, 2, batch_size, -1).sum(dim=1)

        input = trg[:, 0]  # <SOS>

        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input, hidden, cell, encoder_outputs)
            outputs[:, t, :] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[:, t] if teacher_force else top1

        return outputs

# ==================== 4. 训练准备 ====================
hidden_size = 128
num_layers = 2
learning_rate = 0.001
epochs = 50

attention_module = AttentionParams(hidden_size)
encoder = Encoder(len(eng_word2idx), hidden_size, num_layers)
decoder = DecoderWithAttention(len(fra_word2idx), hidden_size, num_layers, attention_module)
model = Seq2SeqAttention(encoder, decoder, device).to(device)

criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# ==================== 5. 训练循环 ====================
print("开始训练...")
for epoch in range(1, epochs+1):
    model.train()
    total_loss = 0
    for batch in dataloader:
        src = batch["src"].to(device)
        trg = batch["trg"].to(device)

        optimizer.zero_grad()
        output = model(src, trg, teacher_forcing_ratio=0.5)

        output_dim = output.shape[-1]
        output = output[:, 1:, :].reshape(-1, output_dim)
        trg = trg[:, 1:].reshape(-1)

        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5)
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(dataloader)
    if epoch % 10 == 0:
        print(f"Epoch {epoch:2d}/{epochs}, Loss: {avg_loss:.4f}")

# ==================== 6. 推理（修正版）====================
def translate(model, sentence, eng_w2i, fra_w2i, fra_i2w, max_len=20):
    """带注意力的贪心解码翻译函数
    Args:
        model: 训练好的模型
        sentence: 英文句子
        eng_w2i: 英文词到索引映射
        fra_w2i: 法文词到索引映射
        fra_i2w: 法文索引到词映射
        max_len: 最大生成长度
    """
    model.eval()
    tokens = sentence.split()
    ids = [eng_w2i.get(w, eng_w2i["<UNK>"]) for w in tokens]
    ids = [eng_w2i["<SOS>"]] + ids + [eng_w2i["<EOS>"]]
    src_tensor = torch.tensor([ids], dtype=torch.long).to(device)

    with torch.no_grad():
        encoder_outputs, hidden, cell = model.encoder(src_tensor)
        # 合并状态
        hidden = hidden.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)
        cell = cell.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)

        trg_indices = [fra_w2i["<SOS>"]]
        for _ in range(max_len):
            trg_tensor = torch.tensor([trg_indices[-1]], dtype=torch.long).to(device)
            output, hidden, cell = model.decoder(trg_tensor, hidden, cell, encoder_outputs)
            pred = output.argmax(1).item()
            trg_indices.append(pred)
            if pred == fra_w2i["<EOS>"]:
                break
    translated = [fra_i2w[idx] for idx in trg_indices[1:-1]]
    return " ".join(translated)

# 测试
test_sentences = ["hello", "goodbye", "thank you", "how are you", "i love you"]
for sent in test_sentences:
    translation = translate(model, sent, eng_word2idx, fra_word2idx, fra_idx2word)
    print(f"English: {sent:20} -> French: {translation}")

输出：

(base) PS E:\Datawhale 2026\base-llm202602> & D:/Users/app/miniconda3/envs/base-llm/python.exe "e:/Datawhale 2026/base-llm202602/04_2_attention_seq2seq.py"

英文词表大小: 24

法文词表大小: 22

使用设备: cuda

开始训练...

Epoch 10/50, Loss: 1.5959

Epoch 20/50, Loss: 0.4359

Epoch 30/50, Loss: 0.1710

Epoch 40/50, Loss: 0.1131

Epoch 50/50, Loss: 0.0592

English: hello -> French: comment vous

English: goodbye -> French: comment vous

English: thank you -> French: merci vous

English: how are you -> French: comment allez vous

English: i love you -> French: je t aime

输出解析：

代码运行正常，没有报错，但翻译结果除了 "i love you" 外都不太理想，特别是 "hello" 被翻译成了 "comment vous"（应该是 "bonjour"）。这说明模型虽然训练损失很低（0.0592），但过拟合了训练集，泛化能力很差。下面我来分析原因并提供改进方案。

一、问题分析

1. 数据集太小（根本原因）

我们只有 10 个句子对，模型相当于在"背"这些句子。虽然训练损失降得很低，但它没有真正学会翻译规则，只是记住了训练集中的模式。对于训练集中出现过的短语（如 "i love you"），它能正确输出；但对于需要灵活组合的短语（如 "hello" 在训练集中出现过，但模型可能把它和 "how are you" 的开头混淆了），它就表现得很差。

2. 训练数据中的歧义

观察训练数据：

"hello" 对应 "bonjour"
"good morning" 也对应 "bonjour"
"how are you" 对应 "comment allez vous"
其他句子开头也常有 "comment"

模型可能学到了一种错误的模式：很多法语句子以 "comment" 开头，所以它倾向于以 "comment" 作为任何句子的开头。

3. 推理时解码器起始状态可能有问题

虽然训练时教师强制帮助模型学到了正确的词序列，但推理时模型自己生成，一旦第一步选错，后面就会越错越离谱。从输出看，所有句子几乎都以 "comment" 或类似词开头，说明解码器的第一步选择普遍错误。

attention_seq2seq_improved.py

python 复制代码

"""
attention_seq2seq_improved.py
改进版带注意力的英法翻译模型
- 增加束搜索解码
- 降低学习率，增加训练轮数
"""

import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# ==================== 1. 设置随机种子 ====================
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

set_seed(42)

# ==================== 2. 准备数据（简单扩充）====================
# 原始数据
pairs_original = [
    ("hello", "bonjour"),
    ("goodbye", "au revoir"),
    ("thank you", "merci"),
    ("how are you", "comment allez vous"),
    ("i love you", "je t aime"),
    ("what is your name", "comment vous appelez vous"),
    ("my name is", "je m appelle"),
    ("nice to meet you", "ravi de vous rencontrer"),
    ("see you tomorrow", "à demain"),
    ("good morning", "bonjour")
]

# 为了增加数据量，将原始数据复制3次（临时解决方案，真实情况应使用更大数据集）
pairs = pairs_original * 3
print(f"扩充后句子对数量: {len(pairs)}")

# 构建词汇表
def build_vocab(sentences):
    word2idx = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2, "<UNK>": 3}
    idx2word = {0: "<PAD>", 1: "<SOS>", 2: "<EOS>", 3: "<UNK>"}
    for sent in sentences:
        for word in sent.split():
            if word not in word2idx:
                idx = len(word2idx)
                word2idx[word] = idx
                idx2word[idx] = word
    return word2idx, idx2word

eng_sentences = [pair[0] for pair in pairs]
fra_sentences = [pair[1] for pair in pairs]
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
fra_word2idx, fra_idx2word = build_vocab(fra_sentences)

print(f"英文词表大小: {len(eng_word2idx)}")
print(f"法文词表大小: {len(fra_word2idx)}")

def sentence_to_ids(sentence, word2idx, max_len=None):
    words = sentence.split()
    ids = [word2idx.get(w, word2idx["<UNK>"]) for w in words]
    ids = [word2idx["<SOS>"]] + ids + [word2idx["<EOS>"]]
    if max_len is not None:
        if len(ids) > max_len:
            ids = ids[:max_len-1] + [word2idx["<EOS>"]]
        else:
            ids += [word2idx["<PAD>"]] * (max_len - len(ids))
    return ids

max_eng_len = max(len(s.split()) for s in eng_sentences) + 2
max_fra_len = max(len(s.split()) for s in fra_sentences) + 2

class TranslationDataset(Dataset):
    def __init__(self, pairs, eng_w2i, fra_w2i, max_eng_len, max_fra_len):
        self.pairs = pairs
        self.eng_w2i = eng_w2i
        self.fra_w2i = fra_w2i
        self.max_eng_len = max_eng_len
        self.max_fra_len = max_fra_len

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        eng, fra = self.pairs[idx]
        eng_ids = sentence_to_ids(eng, self.eng_w2i, self.max_eng_len)
        fra_ids = sentence_to_ids(fra, self.fra_w2i, self.max_fra_len)
        return {
            "eng": torch.tensor(eng_ids, dtype=torch.long),
            "fra": torch.tensor(fra_ids, dtype=torch.long)
        }

def collate_fn(batch):
    eng_batch = torch.stack([item["eng"] for item in batch])
    fra_batch = torch.stack([item["fra"] for item in batch])
    return {"src": eng_batch, "trg": fra_batch}

dataset = TranslationDataset(pairs, eng_word2idx, fra_word2idx, max_eng_len, max_fra_len)
dataloader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=collate_fn)

# ==================== 3. 定义模型组件 ====================
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"使用设备: {device}")

class Encoder(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size, hidden_size, num_layers,
                           batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_size * 2, hidden_size)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, (hidden, cell) = self.rnn(embedded)
        outputs = torch.tanh(self.fc(outputs))
        return outputs, hidden, cell

class AttentionParams(nn.Module):
    def __init__(self, hidden_size):
        super(AttentionParams, self).__init__()
        self.attn = nn.Linear(hidden_size * 2, hidden_size)
        self.v = nn.Parameter(torch.randn(hidden_size))

    def forward(self, hidden, encoder_outputs):
        batch_size, src_len = encoder_outputs.shape[:2]
        hidden_last = hidden[-1].unsqueeze(1).repeat(1, src_len, 1)
        combined = torch.cat((hidden_last, encoder_outputs), dim=2)
        energy = torch.tanh(self.attn(combined))
        scores = torch.einsum("bsh,h->bs", energy, self.v)
        attn_weights = torch.softmax(scores, dim=1)
        return attn_weights

class DecoderWithAttention(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers, attention_module):
        super(DecoderWithAttention, self).__init__()
        self.attention = attention_module
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size * 2, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, hidden, cell, encoder_outputs):
        embedded = self.embedding(x.unsqueeze(1))
        attn_weights = self.attention(hidden, encoder_outputs)
        context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs)
        rnn_input = torch.cat((embedded, context), dim=2)
        output, (hidden, cell) = self.rnn(rnn_input, (hidden, cell))
        predictions = self.fc(output.squeeze(1))
        return predictions, hidden, cell

class Seq2SeqAttention(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2SeqAttention, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size, trg_len = trg.shape
        trg_vocab_size = self.decoder.fc.out_features
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

        encoder_outputs, hidden, cell = self.encoder(src)

        # 合并双向状态
        hidden = hidden.view(self.encoder.rnn.num_layers, 2, batch_size, -1).sum(dim=1)
        cell = cell.view(self.encoder.rnn.num_layers, 2, batch_size, -1).sum(dim=1)

        input = trg[:, 0]

        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input, hidden, cell, encoder_outputs)
            outputs[:, t, :] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[:, t] if teacher_force else top1

        return outputs

# ==================== 4. 训练准备 ====================
hidden_size = 256      # 增大隐藏层
num_layers = 3         # 增加层数
learning_rate = 0.0005 # 降低学习率
epochs = 100

attention_module = AttentionParams(hidden_size)
encoder = Encoder(len(eng_word2idx), hidden_size, num_layers)
decoder = DecoderWithAttention(len(fra_word2idx), hidden_size, num_layers, attention_module)
model = Seq2SeqAttention(encoder, decoder, device).to(device)

criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# ==================== 5. 训练循环 ====================
print("开始训练...")
for epoch in range(1, epochs+1):
    model.train()
    total_loss = 0
    for batch in dataloader:
        src = batch["src"].to(device)
        trg = batch["trg"].to(device)

        optimizer.zero_grad()
        output = model(src, trg, teacher_forcing_ratio=0.5)

        output_dim = output.shape[-1]
        output = output[:, 1:, :].reshape(-1, output_dim)
        trg = trg[:, 1:].reshape(-1)

        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5)
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(dataloader)
    if epoch % 10 == 0:
        print(f"Epoch {epoch:3d}/{epochs}, Loss: {avg_loss:.4f}")

# ==================== 6. 束搜索推理 ====================
def beam_search_translate(model, sentence, eng_w2i, fra_w2i, fra_i2w, max_len=20, beam_size=3):
    """
    束搜索解码
    beam_size: 保留的候选路径数
    """
    model.eval()
    tokens = sentence.split()
    ids = [eng_w2i.get(w, eng_w2i["<UNK>"]) for w in tokens]
    ids = [eng_w2i["<SOS>"]] + ids + [eng_w2i["<EOS>"]]
    src_tensor = torch.tensor([ids], dtype=torch.long).to(device)

    with torch.no_grad():
        encoder_outputs, hidden, cell = model.encoder(src_tensor)
        # 合并状态
        hidden = hidden.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)
        cell = cell.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)

        # 初始化束：每个候选包含 (序列, 对数概率, hidden, cell)
        sequences = [[[fra_w2i["<SOS>"]], 0.0, hidden, cell]]

        for _ in range(max_len):
            all_candidates = []
            for seq, score, h, c in sequences:
                if seq[-1] == fra_w2i["<EOS>"]:
                    # 已经结束的序列直接保留
                    all_candidates.append((seq, score, h, c))
                    continue
                trg_tensor = torch.tensor([seq[-1]], dtype=torch.long).to(device)
                output, h_new, c_new = model.decoder(trg_tensor, h, c, encoder_outputs)
                log_probs = torch.log_softmax(output, dim=1)
                topk_probs, topk_ids = log_probs.topk(beam_size)
                for i in range(beam_size):
                    new_seq = seq + [topk_ids[0, i].item()]
                    new_score = score + topk_probs[0, i].item()
                    all_candidates.append((new_seq, new_score, h_new, c_new))
            # 按概率排序，保留 beam_size 个最佳
            sequences = sorted(all_candidates, key=lambda x: x[1], reverse=True)[:beam_size]
            # 如果所有候选都已结束，可提前退出
            if all(seq[-1] == fra_w2i["<EOS>"] for seq, _, _, _ in sequences):
                break

        # 选择得分最高的序列
        best_seq = sequences[0][0]

    # 忽略 <SOS> 和 <EOS>
    translated = [fra_i2w[idx] for idx in best_seq[1:-1]]
    return " ".join(translated)

# 贪心解码（用于对比）
def greedy_translate(model, sentence, eng_w2i, fra_w2i, fra_i2w, max_len=20):
    model.eval()
    tokens = sentence.split()
    ids = [eng_w2i.get(w, eng_w2i["<UNK>"]) for w in tokens]
    ids = [eng_w2i["<SOS>"]] + ids + [eng_w2i["<EOS>"]]
    src_tensor = torch.tensor([ids], dtype=torch.long).to(device)

    with torch.no_grad():
        encoder_outputs, hidden, cell = model.encoder(src_tensor)
        hidden = hidden.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)
        cell = cell.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)

        trg_indices = [fra_w2i["<SOS>"]]
        for _ in range(max_len):
            trg_tensor = torch.tensor([trg_indices[-1]], dtype=torch.long).to(device)
            output, hidden, cell = model.decoder(trg_tensor, hidden, cell, encoder_outputs)
            pred = output.argmax(1).item()
            trg_indices.append(pred)
            if pred == fra_w2i["<EOS>"]:
                break
    translated = [fra_i2w[idx] for idx in trg_indices[1:-1]]
    return " ".join(translated)

# 测试
test_sentences = ["hello", "goodbye", "thank you", "how are you", "i love you"]
print("\n" + "="*50)
print("贪心解码结果：")
for sent in test_sentences:
    translation = greedy_translate(model, sent, eng_word2idx, fra_word2idx, fra_idx2word)
    print(f"English: {sent:20} -> French: {translation}")

print("\n" + "="*50)
print("束搜索解码结果（beam_size=3）：")
for sent in test_sentences:
    translation = beam_search_translate(model, sent, eng_word2idx, fra_word2idx, fra_idx2word, beam_size=3)
    print(f"English: {sent:20} -> French: {translation}")

输出：

(base) PS E:\Datawhale 2026\base-llm202602> & D:/Users/app/miniconda3/envs/base-llm/python.exe "e:/Datawhale 2026/base-llm202602/04_3_attention_seq2seq_improved.py"

扩充后句子对数量: 30

英文词表大小: 24

法文词表大小: 22

使用设备: cuda

开始训练...

Epoch 10/100, Loss: 0.2668

Epoch 20/100, Loss: 0.0694

Epoch 30/100, Loss: 0.0389

Epoch 40/100, Loss: 0.0164

Epoch 50/100, Loss: 0.0100

Epoch 60/100, Loss: 0.0080

Epoch 70/100, Loss: 0.0059

Epoch 80/100, Loss: 0.0045

Epoch 90/100, Loss: 0.0038

Epoch 100/100, Loss: 0.0031

==================================================

贪心解码结果：

English: hello -> French: comment vous vous

English: goodbye -> French: comment

English: thank you -> French: ravi vous rencontrer

English: how are you -> French: comment allez vous

English: i love you -> French: je t aime

==================================================

束搜索解码结果（beam_size=3）：

English: hello -> French: comment vous vous

English: goodbye -> French: t

English: thank you -> French: ravi vous rencontrer

English: how are you -> French: comment allez vous

English: i love you -> French: je t aime

训练损失降到了非常低的水平（0.0031），但翻译结果仍然不理想，这明显是过拟合的典型表现。模型已经"背熟"了训练集中的句子对，但并没有真正学会翻译的规律，因此在遇到训练集内的句子时（如"how are you"、"i love you"）能正确输出，而遇到需要泛化的句子（如"hello"、"goodbye"、"thank you"）时，就胡乱输出了一些训练集中常见的片段。

📊 输出分析

英文	贪心解码结果	束搜索结果	正确翻译	分析
hello	comment vous vous	comment vous vous	bonjour	完全错误，模型误用了"comment vous"开头
goodbye	comment	t	au revoir	错误，"comment"来自"how are you"的开头
thank you	ravi vous rencontrer	ravi vous rencontrer	merci	错误，这是"nice to meet you"的翻译片段
how are you	comment allez vous	comment allez vous	comment allez vous	✅ 正确（在训练集中出现）
i love you	je t aime	je t aime	je t aime	✅ 正确（在训练集中出现）

🔍 原因分析

数据量太少：虽然把数据复制了三倍（30对），但本质上还是那10个句子的重复。模型只是背熟了这些句子，没有学到词与词之间的对应关系。例如，它学会了"comment"常出现在句首，但不知道"hello"应该对应"bonjour"。
模型容量过大 ：用了 hidden_size=256、num_layers=3，对于30个样本来说，模型参数太多，很容易记住每个样本，而不是学习规律。
束搜索无法挽救：束搜索只是在概率空间里找更好的路径，但如果模型本身的概率分布是错的（例如第一步就倾向于"comment"），束搜索也无能为力。

🚀 下一步建议

1. 获取真实数据集（最关键）

需要一个至少几百对句子的大数据集。

Anki 英法数据集：可以直接下载 fra-eng.zip，解压后得到一个文本文件，每行是"英文\t法文"。可以用它来替换现在的 pairs 列表。

https://www.manythings.org/anki/从这里下载数据集

python 复制代码

"""
attention_seq2seq_final.py
完整版带注意力的英法翻译模型
- 从 fra.txt 读取数据，取前5000对
- 文本清洗（小写、处理特殊空格，保留标点）
- 划分训练集和验证集（90%训练，10%验证）
- 训练时监控验证损失，保存最佳模型
- 包含贪心解码和束搜索推理
"""

import random
import re
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split

# ==================== 1. 设置随机种子 ====================
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

set_seed(42)

# ==================== 2. 数据准备 ====================
# 读取 fra.txt 文件（请确保文件与脚本在同一目录）
print("正在加载数据...")
with open("fra.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()

# 取前5000对句子，每行格式：英文\t法文\t版权信息
raw_pairs = [line.strip().split("\t") for line in lines[:5000]]

# 清洗函数：小写、替换特殊空格（保留标点）
def clean_text(text):
    text = text.lower().strip()
    text = text.replace('\u00A0', ' ')  # 替换不换行空格
    return text

pairs = [(clean_text(eng), clean_text(fra)) for eng, fra, *_ in raw_pairs]
print(f"总句子对数: {len(pairs)}")

# 划分训练集和验证集
train_pairs, val_pairs = train_test_split(pairs, test_size=0.1, random_state=42)
print(f"训练集大小: {len(train_pairs)}")
print(f"验证集大小: {len(val_pairs)}")

# 构建词汇表
def build_vocab(sentences):
    word2idx = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2, "<UNK>": 3}
    idx2word = {0: "<PAD>", 1: "<SOS>", 2: "<EOS>", 3: "<UNK>"}
    for sent in sentences:
        for word in sent.split():
            if word not in word2idx:
                idx = len(word2idx)
                word2idx[word] = idx
                idx2word[idx] = word
    return word2idx, idx2word

# 用所有句子构建词汇表（确保验证集不会遇到未登录词）
eng_sentences = [p[0] for p in pairs]
fra_sentences = [p[1] for p in pairs]
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
fra_word2idx, fra_idx2word = build_vocab(fra_sentences)

print(f"英文词表大小: {len(eng_word2idx)}")
print(f"法文词表大小: {len(fra_word2idx)}")

# 句子转 ID 序列（添加 SOS 和 EOS，并填充到最大长度）
def sentence_to_ids(sentence, word2idx, max_len=None):
    words = sentence.split()
    ids = [word2idx.get(w, word2idx["<UNK>"]) for w in words]
    ids = [word2idx["<SOS>"]] + ids + [word2idx["<EOS>"]]
    if max_len is not None:
        if len(ids) > max_len:
            ids = ids[:max_len-1] + [word2idx["<EOS>"]]
        else:
            ids += [word2idx["<PAD>"]] * (max_len - len(ids))
    return ids

# 计算最大长度（基于所有句子，用于填充）
max_eng_len = max(len(s.split()) for s in eng_sentences) + 2
max_fra_len = max(len(s.split()) for s in fra_sentences) + 2
print(f"最大英文序列长度: {max_eng_len}")
print(f"最大法文序列长度: {max_fra_len}")

# 数据集类
class TranslationDataset(Dataset):
    def __init__(self, pairs, eng_w2i, fra_w2i, max_eng_len, max_fra_len):
        self.pairs = pairs
        self.eng_w2i = eng_w2i
        self.fra_w2i = fra_w2i
        self.max_eng_len = max_eng_len
        self.max_fra_len = max_fra_len

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        eng, fra = self.pairs[idx]
        eng_ids = sentence_to_ids(eng, self.eng_w2i, self.max_eng_len)
        fra_ids = sentence_to_ids(fra, self.fra_w2i, self.max_fra_len)
        return {
            "eng": torch.tensor(eng_ids, dtype=torch.long),
            "fra": torch.tensor(fra_ids, dtype=torch.long)
        }

# collate_fn：将批次中的样本堆叠
def collate_fn(batch):
    eng_batch = torch.stack([item["eng"] for item in batch])
    fra_batch = torch.stack([item["fra"] for item in batch])
    return {"src": eng_batch, "trg": fra_batch}

# 创建 DataLoader
batch_size = 64
train_dataset = TranslationDataset(train_pairs, eng_word2idx, fra_word2idx, max_eng_len, max_fra_len)
val_dataset = TranslationDataset(val_pairs, eng_word2idx, fra_word2idx, max_eng_len, max_fra_len)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

# ==================== 3. 定义模型组件 ====================
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"使用设备: {device}")

class Encoder(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size, hidden_size, num_layers,
                           batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_size * 2, hidden_size)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, (hidden, cell) = self.rnn(embedded)
        outputs = torch.tanh(self.fc(outputs))
        return outputs, hidden, cell

class AttentionParams(nn.Module):
    def __init__(self, hidden_size):
        super(AttentionParams, self).__init__()
        self.attn = nn.Linear(hidden_size * 2, hidden_size)
        self.v = nn.Parameter(torch.randn(hidden_size))

    def forward(self, hidden, encoder_outputs):
        batch_size, src_len = encoder_outputs.shape[:2]
        hidden_last = hidden[-1].unsqueeze(1).repeat(1, src_len, 1)
        combined = torch.cat((hidden_last, encoder_outputs), dim=2)
        energy = torch.tanh(self.attn(combined))
        scores = torch.einsum("bsh,h->bs", energy, self.v)
        attn_weights = torch.softmax(scores, dim=1)
        return attn_weights

class DecoderWithAttention(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers, attention_module):
        super(DecoderWithAttention, self).__init__()
        self.attention = attention_module
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size * 2, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, hidden, cell, encoder_outputs):
        embedded = self.embedding(x.unsqueeze(1))
        attn_weights = self.attention(hidden, encoder_outputs)
        context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs)
        rnn_input = torch.cat((embedded, context), dim=2)
        output, (hidden, cell) = self.rnn(rnn_input, (hidden, cell))
        predictions = self.fc(output.squeeze(1))
        return predictions, hidden, cell

class Seq2SeqAttention(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2SeqAttention, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size, trg_len = trg.shape
        trg_vocab_size = self.decoder.fc.out_features
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

        encoder_outputs, hidden, cell = self.encoder(src)

        # 合并双向状态
        hidden = hidden.view(self.encoder.rnn.num_layers, 2, batch_size, -1).sum(dim=1)
        cell = cell.view(self.encoder.rnn.num_layers, 2, batch_size, -1).sum(dim=1)

        input = trg[:, 0]  # <SOS>

        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input, hidden, cell, encoder_outputs)
            outputs[:, t, :] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[:, t] if teacher_force else top1

        return outputs

# ==================== 4. 训练准备 ====================
hidden_size = 256      # 隐藏层维度
num_layers = 3         # LSTM 层数
learning_rate = 0.0005 # 学习率
epochs = 100           # 最大训练轮数

attention_module = AttentionParams(hidden_size)
encoder = Encoder(len(eng_word2idx), hidden_size, num_layers)
decoder = DecoderWithAttention(len(fra_word2idx), hidden_size, num_layers, attention_module)
model = Seq2SeqAttention(encoder, decoder, device).to(device)

criterion = nn.CrossEntropyLoss(ignore_index=0)  # 忽略 <PAD>
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# ==================== 5. 训练循环（带验证和早停）====================
best_val_loss = float('inf')
patience = 5
patience_counter = 0

print("开始训练...")
for epoch in range(1, epochs+1):
    # 训练阶段
    model.train()
    train_loss = 0.0
    for batch in train_loader:
        src = batch["src"].to(device)
        trg = batch["trg"].to(device)

        optimizer.zero_grad()
        output = model(src, trg, teacher_forcing_ratio=0.5)

        output_dim = output.shape[-1]
        output = output[:, 1:, :].reshape(-1, output_dim)
        trg = trg[:, 1:].reshape(-1)

        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5)
        optimizer.step()

        train_loss += loss.item()

    avg_train_loss = train_loss / len(train_loader)

    # 验证阶段（使用教师强制，teacher_forcing_ratio=1）
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for batch in val_loader:
            src = batch["src"].to(device)
            trg = batch["trg"].to(device)
            output = model(src, trg, teacher_forcing_ratio=1)  # 使用真实目标作为输入
            output_dim = output.shape[-1]
            output = output[:, 1:, :].reshape(-1, output_dim)
            trg = trg[:, 1:].reshape(-1)
            loss = criterion(output, trg)
            val_loss += loss.item()
    avg_val_loss = val_loss / len(val_loader)

    print(f"Epoch {epoch:3d}: train loss {avg_train_loss:.4f}, val loss {avg_val_loss:.4f}")

    # 保存最佳模型
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        torch.save(model.state_dict(), "best_model.pth")
        print(f"  -> 保存最佳模型 (val loss {best_val_loss:.4f})")
        patience_counter = 0
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"验证损失连续 {patience} 轮未下降，提前停止训练。")
            break

# 加载最佳模型用于推理
model.load_state_dict(torch.load("best_model.pth"))
print(f"训练完成，最佳验证损失: {best_val_loss:.4f}")

# ==================== 6. 推理函数 ====================
def greedy_translate(model, sentence, eng_w2i, fra_w2i, fra_i2w, max_len=20):
    model.eval()
    tokens = sentence.split()
    ids = [eng_w2i.get(w, eng_w2i["<UNK>"]) for w in tokens]
    ids = [eng_w2i["<SOS>"]] + ids + [eng_w2i["<EOS>"]]
    src_tensor = torch.tensor([ids], dtype=torch.long).to(device)

    with torch.no_grad():
        encoder_outputs, hidden, cell = model.encoder(src_tensor)
        hidden = hidden.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)
        cell = cell.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)

        trg_indices = [fra_w2i["<SOS>"]]
        for _ in range(max_len):
            trg_tensor = torch.tensor([trg_indices[-1]], dtype=torch.long).to(device)
            output, hidden, cell = model.decoder(trg_tensor, hidden, cell, encoder_outputs)
            pred = output.argmax(1).item()
            trg_indices.append(pred)
            if pred == fra_w2i["<EOS>"]:
                break
    translated = [fra_i2w[idx] for idx in trg_indices[1:-1]]
    return " ".join(translated)

def beam_search_translate(model, sentence, eng_w2i, fra_w2i, fra_i2w, max_len=20, beam_size=3):
    model.eval()
    tokens = sentence.split()
    ids = [eng_w2i.get(w, eng_w2i["<UNK>"]) for w in tokens]
    ids = [eng_w2i["<SOS>"]] + ids + [eng_w2i["<EOS>"]]
    src_tensor = torch.tensor([ids], dtype=torch.long).to(device)

    with torch.no_grad():
        encoder_outputs, hidden, cell = model.encoder(src_tensor)
        hidden = hidden.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)
        cell = cell.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)

        sequences = [[[fra_w2i["<SOS>"]], 0.0, hidden, cell]]

        for _ in range(max_len):
            all_candidates = []
            for seq, score, h, c in sequences:
                if seq[-1] == fra_w2i["<EOS>"]:
                    all_candidates.append((seq, score, h, c))
                    continue
                trg_tensor = torch.tensor([seq[-1]], dtype=torch.long).to(device)
                output, h_new, c_new = model.decoder(trg_tensor, h, c, encoder_outputs)
                log_probs = torch.log_softmax(output, dim=1)
                topk_probs, topk_ids = log_probs.topk(beam_size)
                for i in range(beam_size):
                    new_seq = seq + [topk_ids[0, i].item()]
                    new_score = score + topk_probs[0, i].item()
                    all_candidates.append((new_seq, new_score, h_new, c_new))
            sequences = sorted(all_candidates, key=lambda x: x[1], reverse=True)[:beam_size]
            if all(seq[-1] == fra_w2i["<EOS>"] for seq, _, _, _ in sequences):
                break

        best_seq = sequences[0][0]
    translated = [fra_i2w[idx] for idx in best_seq[1:-1]]
    return " ".join(translated)

# ==================== 7. 测试示例 ====================
test_sentences = ["hello", "goodbye", "thank you", "how are you", "i love you"]
print("\n" + "="*50)
print("贪心解码结果：")
for sent in test_sentences:
    translation = greedy_translate(model, sent, eng_word2idx, fra_word2idx, fra_idx2word)
    print(f"English: {sent:20} -> French: {translation}")

print("\n" + "="*50)
print("束搜索解码结果（beam_size=3）：")
for sent in test_sentences:
    translation = beam_search_translate(model, sent, eng_word2idx, fra_word2idx, fra_idx2word, beam_size=3)
    print(f"English: {sent:20} -> French: {translation}")

输出：

(base) PS E:\Datawhale 2026\base-llm202602> & D:/Users/app/miniconda3/envs/base-llm/python.exe "e:/Datawhale 2026/base-llm202602/04_4_attention_seq2seq_final.py"

正在加载数据...

总句子对数: 5000

训练集大小: 4500

验证集大小: 500

英文词表大小: 1477

法文词表大小: 3105

最大英文序列长度: 6

最大法文序列长度: 12

使用设备: cuda

开始训练...

Epoch 1: train loss 5.5504, val loss 4.9673

-> 保存最佳模型 (val loss 4.9673)

Epoch 2: train loss 4.7135, val loss 4.7811

-> 保存最佳模型 (val loss 4.7811)

Epoch 3: train loss 4.4996, val loss 4.6484

-> 保存最佳模型 (val loss 4.6484)

Epoch 4: train loss 4.3109, val loss 4.5100

-> 保存最佳模型 (val loss 4.5100)

Epoch 5: train loss 4.1283, val loss 4.3813

-> 保存最佳模型 (val loss 4.3813)

Epoch 6: train loss 3.9339, val loss 4.2743

-> 保存最佳模型 (val loss 4.2743)

Epoch 7: train loss 3.7696, val loss 4.1226

-> 保存最佳模型 (val loss 4.1226)

Epoch 8: train loss 3.5992, val loss 3.9994

-> 保存最佳模型 (val loss 3.9994)

Epoch 9: train loss 3.4472, val loss 3.8782

-> 保存最佳模型 (val loss 3.8782)

Epoch 10: train loss 3.3213, val loss 3.8716

-> 保存最佳模型 (val loss 3.8716)

Epoch 11: train loss 3.1970, val loss 3.7450

-> 保存最佳模型 (val loss 3.7450)

Epoch 12: train loss 3.0623, val loss 3.6866

-> 保存最佳模型 (val loss 3.6866)

Epoch 13: train loss 2.9440, val loss 3.6336

-> 保存最佳模型 (val loss 3.6336)

Epoch 14: train loss 2.8478, val loss 3.5979

-> 保存最佳模型 (val loss 3.5979)

Epoch 15: train loss 2.7540, val loss 3.5463

-> 保存最佳模型 (val loss 3.5463)

Epoch 16: train loss 2.6331, val loss 3.5163

-> 保存最佳模型 (val loss 3.5163)

Epoch 17: train loss 2.5357, val loss 3.4695

-> 保存最佳模型 (val loss 3.4695)

Epoch 18: train loss 2.4327, val loss 3.4429

-> 保存最佳模型 (val loss 3.4429)

Epoch 19: train loss 2.3373, val loss 3.4292

-> 保存最佳模型 (val loss 3.4292)

Epoch 20: train loss 2.2546, val loss 3.3905

-> 保存最佳模型 (val loss 3.3905)

Epoch 21: train loss 2.1472, val loss 3.3853

-> 保存最佳模型 (val loss 3.3853)

Epoch 22: train loss 2.0699, val loss 3.3549

-> 保存最佳模型 (val loss 3.3549)

Epoch 23: train loss 2.0038, val loss 3.3430

-> 保存最佳模型 (val loss 3.3430)

Epoch 24: train loss 1.9192, val loss 3.3366

-> 保存最佳模型 (val loss 3.3366)

Epoch 25: train loss 1.8470, val loss 3.3239

-> 保存最佳模型 (val loss 3.3239)

Epoch 26: train loss 1.7481, val loss 3.3068

-> 保存最佳模型 (val loss 3.3068)

Epoch 27: train loss 1.6698, val loss 3.2924

-> 保存最佳模型 (val loss 3.2924)

Epoch 28: train loss 1.5948, val loss 3.2799

-> 保存最佳模型 (val loss 3.2799)

Epoch 29: train loss 1.5541, val loss 3.2877

Epoch 30: train loss 1.4746, val loss 3.2675

-> 保存最佳模型 (val loss 3.2675)

Epoch 31: train loss 1.4178, val loss 3.2822

Epoch 32: train loss 1.3466, val loss 3.2625

-> 保存最佳模型 (val loss 3.2625)

Epoch 33: train loss 1.2826, val loss 3.2609

-> 保存最佳模型 (val loss 3.2609)

Epoch 34: train loss 1.2400, val loss 3.2561

-> 保存最佳模型 (val loss 3.2561)

Epoch 35: train loss 1.1706, val loss 3.2628

Epoch 36: train loss 1.1317, val loss 3.2654

Epoch 37: train loss 1.0765, val loss 3.2580

Epoch 38: train loss 1.0425, val loss 3.2772

Epoch 39: train loss 0.9873, val loss 3.2685

验证损失连续 5 轮未下降，提前停止训练。

训练完成，最佳验证损失: 3.2561

==================================================

贪心解码结果：

English: hello -> French: suis-je ?

English: goodbye -> French: comment que ?

English: thank you -> French: tu que ?

English: how are you -> French: comment va ?

English: i love you -> French: je ai !

==================================================

束搜索解码结果（beam_size=3）：

English: hello -> French: suis-je ?

English: goodbye -> French: comment que ?

English: thank you -> French: tu que ?

English: how are you -> French: comment va ?

English: i love you -> French: je ai !

📊 结果分析

英文	模型翻译	期望翻译	分析
hello	suis-je ?	bonjour / salut	模型误学了疑问句式
goodbye	comment que ?	au revoir	错用疑问词 "comment"
thank you	tu que ?	merci	完全错误
how are you	comment va ?	comment allez vous	接近但不够准确
i love you	je ai !	je t'aime	动词错误（应为 "aime"）

主要问题

过拟合：训练损失（0.987）远低于验证损失（3.25），说明模型死记硬背了训练集，但无法泛化。
数据量不足：5000对句子对于 hidden_size=256, num_layers=3 的模型来说参数太多，容易过拟合。
数据分布问题：训练集中可能存在大量疑问句，导致模型倾向于输出问句。

python 复制代码

"""
final_attention_translation.py
完整版带注意力的英法翻译模型（基于 Anki 数据集）
- 使用更多数据（默认前20000对）
- 调整超参数：hidden_size=256, num_layers=3, dropout=0.3, lr=0.0002, batch_size=128
- 在 LSTM 中加入 Dropout 正则化
- 训练/验证划分，早停机制，保存最佳模型
- 支持贪心解码和束搜索推理
"""

import random
import re
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split

# ==================== 1. 设置随机种子 ====================
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

set_seed(42)

# ==================== 2. 数据准备 ====================
print("正在加载数据...")
with open("fra.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()

# 选择数据量（可根据显存和时间调整，例如 20000, 50000, 或使用全部数据）
# 这里使用前20000对作为示例，您可以修改切片数字或去掉切片以使用全部数据
num_pairs = 20000  # 可以根据需要修改
raw_pairs = [line.strip().split("\t") for line in lines[:num_pairs]]
print(f"使用句子对数: {len(raw_pairs)}")

# 清洗函数：小写、替换特殊空格（保留标点）
def clean_text(text):
    text = text.lower().strip()
    text = text.replace('\u00A0', ' ')  # 替换不换行空格
    return text

pairs = [(clean_text(eng), clean_text(fra)) for eng, fra, *_ in raw_pairs]
print(f"清洗后有效句子对数: {len(pairs)}")

# 划分训练集和验证集
train_pairs, val_pairs = train_test_split(pairs, test_size=0.1, random_state=42)
print(f"训练集大小: {len(train_pairs)}")
print(f"验证集大小: {len(val_pairs)}")

# 构建词汇表
def build_vocab(sentences):
    word2idx = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2, "<UNK>": 3}
    idx2word = {0: "<PAD>", 1: "<SOS>", 2: "<EOS>", 3: "<UNK>"}
    for sent in sentences:
        for word in sent.split():
            if word not in word2idx:
                idx = len(word2idx)
                word2idx[word] = idx
                idx2word[idx] = word
    return word2idx, idx2word

# 用所有句子构建词汇表（保证验证集不会遇到未登录词）
eng_sentences = [p[0] for p in pairs]
fra_sentences = [p[1] for p in pairs]
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
fra_word2idx, fra_idx2word = build_vocab(fra_sentences)

print(f"英文词表大小: {len(eng_word2idx)}")
print(f"法文词表大小: {len(fra_word2idx)}")

# 句子转 ID 序列（添加 SOS 和 EOS，并填充到最大长度）
def sentence_to_ids(sentence, word2idx, max_len=None):
    words = sentence.split()
    ids = [word2idx.get(w, word2idx["<UNK>"]) for w in words]
    ids = [word2idx["<SOS>"]] + ids + [word2idx["<EOS>"]]
    if max_len is not None:
        if len(ids) > max_len:
            ids = ids[:max_len-1] + [word2idx["<EOS>"]]
        else:
            ids += [word2idx["<PAD>"]] * (max_len - len(ids))
    return ids

# 计算最大长度（基于所有句子，用于填充）
max_eng_len = max(len(s.split()) for s in eng_sentences) + 2
max_fra_len = max(len(s.split()) for s in fra_sentences) + 2
print(f"最大英文序列长度: {max_eng_len}")
print(f"最大法文序列长度: {max_fra_len}")

# 数据集类
class TranslationDataset(Dataset):
    def __init__(self, pairs, eng_w2i, fra_w2i, max_eng_len, max_fra_len):
        self.pairs = pairs
        self.eng_w2i = eng_w2i
        self.fra_w2i = fra_w2i
        self.max_eng_len = max_eng_len
        self.max_fra_len = max_fra_len

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        eng, fra = self.pairs[idx]
        eng_ids = sentence_to_ids(eng, self.eng_w2i, self.max_eng_len)
        fra_ids = sentence_to_ids(fra, self.fra_w2i, self.max_fra_len)
        return {
            "eng": torch.tensor(eng_ids, dtype=torch.long),
            "fra": torch.tensor(fra_ids, dtype=torch.long)
        }

# collate_fn：将批次中的样本堆叠
def collate_fn(batch):
    eng_batch = torch.stack([item["eng"] for item in batch])
    fra_batch = torch.stack([item["fra"] for item in batch])
    return {"src": eng_batch, "trg": fra_batch}

# 超参数
batch_size = 128  # 可根据显存调整
train_dataset = TranslationDataset(train_pairs, eng_word2idx, fra_word2idx, max_eng_len, max_fra_len)
val_dataset = TranslationDataset(val_pairs, eng_word2idx, fra_word2idx, max_eng_len, max_fra_len)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

# ==================== 3. 定义模型组件 ====================
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"使用设备: {device}")

class Encoder(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers, dropout=0.3):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size, hidden_size, num_layers,
                           batch_first=True, bidirectional=True,
                           dropout=dropout if num_layers > 1 else 0)
        self.fc = nn.Linear(hidden_size * 2, hidden_size)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, (hidden, cell) = self.rnn(embedded)
        outputs = torch.tanh(self.fc(outputs))
        return outputs, hidden, cell

class AttentionParams(nn.Module):
    def __init__(self, hidden_size):
        super(AttentionParams, self).__init__()
        self.attn = nn.Linear(hidden_size * 2, hidden_size)
        self.v = nn.Parameter(torch.randn(hidden_size))

    def forward(self, hidden, encoder_outputs):
        batch_size, src_len = encoder_outputs.shape[:2]
        hidden_last = hidden[-1].unsqueeze(1).repeat(1, src_len, 1)
        combined = torch.cat((hidden_last, encoder_outputs), dim=2)
        energy = torch.tanh(self.attn(combined))
        scores = torch.einsum("bsh,h->bs", energy, self.v)
        attn_weights = torch.softmax(scores, dim=1)
        return attn_weights

class DecoderWithAttention(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers, attention_module, dropout=0.3):
        super(DecoderWithAttention, self).__init__()
        self.attention = attention_module
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size * 2, hidden_size, num_layers,
                           batch_first=True,
                           dropout=dropout if num_layers > 1 else 0)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, hidden, cell, encoder_outputs):
        embedded = self.embedding(x.unsqueeze(1))
        attn_weights = self.attention(hidden, encoder_outputs)
        context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs)
        rnn_input = torch.cat((embedded, context), dim=2)
        output, (hidden, cell) = self.rnn(rnn_input, (hidden, cell))
        predictions = self.fc(output.squeeze(1))
        return predictions, hidden, cell

class Seq2SeqAttention(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2SeqAttention, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size, trg_len = trg.shape
        trg_vocab_size = self.decoder.fc.out_features
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

        encoder_outputs, hidden, cell = self.encoder(src)

        # 合并双向状态
        hidden = hidden.view(self.encoder.rnn.num_layers, 2, batch_size, -1).sum(dim=1)
        cell = cell.view(self.encoder.rnn.num_layers, 2, batch_size, -1).sum(dim=1)

        input = trg[:, 0]  # <SOS>

        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input, hidden, cell, encoder_outputs)
            outputs[:, t, :] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[:, t] if teacher_force else top1

        return outputs

# ==================== 4. 训练准备 ====================
hidden_size = 256
num_layers = 3
dropout = 0.3
learning_rate = 0.0002
epochs = 100

attention_module = AttentionParams(hidden_size)
encoder = Encoder(len(eng_word2idx), hidden_size, num_layers, dropout)
decoder = DecoderWithAttention(len(fra_word2idx), hidden_size, num_layers, attention_module, dropout)
model = Seq2SeqAttention(encoder, decoder, device).to(device)

criterion = nn.CrossEntropyLoss(ignore_index=0)  # 忽略 <PAD>
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# 学习率调度器（可选）
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=2)

# ==================== 5. 训练循环（带验证和早停）====================
best_val_loss = float('inf')
patience = 5
patience_counter = 0

print("开始训练...")
for epoch in range(1, epochs+1):
    # 训练阶段
    model.train()
    train_loss = 0.0
    for batch in train_loader:
        src = batch["src"].to(device)
        trg = batch["trg"].to(device)

        optimizer.zero_grad()
        output = model(src, trg, teacher_forcing_ratio=0.5)

        output_dim = output.shape[-1]
        output = output[:, 1:, :].reshape(-1, output_dim)
        trg = trg[:, 1:].reshape(-1)

        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5)
        optimizer.step()

        train_loss += loss.item()

    avg_train_loss = train_loss / len(train_loader)

    # 验证阶段（使用真实目标作为输入，teacher_forcing_ratio=1）
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for batch in val_loader:
            src = batch["src"].to(device)
            trg = batch["trg"].to(device)
            output = model(src, trg, teacher_forcing_ratio=1)  # 使用真实目标
            output_dim = output.shape[-1]
            output = output[:, 1:, :].reshape(-1, output_dim)
            trg = trg[:, 1:].reshape(-1)
            loss = criterion(output, trg)
            val_loss += loss.item()
    avg_val_loss = val_loss / len(val_loader)

    # 更新学习率
    scheduler.step(avg_val_loss)

    print(f"Epoch {epoch:3d}: train loss {avg_train_loss:.4f}, val loss {avg_val_loss:.4f}")

    # 保存最佳模型
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        torch.save(model.state_dict(), "best_model.pth")
        print(f"  -> 保存最佳模型 (val loss {best_val_loss:.4f})")
        patience_counter = 0
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"验证损失连续 {patience} 轮未下降，提前停止训练。")
            break

# 加载最佳模型用于推理
model.load_state_dict(torch.load("best_model.pth"))
print(f"训练完成，最佳验证损失: {best_val_loss:.4f}")

# ==================== 6. 推理函数 ====================
def greedy_translate(model, sentence, eng_w2i, fra_w2i, fra_i2w, max_len=20):
    model.eval()
    tokens = sentence.split()
    ids = [eng_w2i.get(w, eng_w2i["<UNK>"]) for w in tokens]
    ids = [eng_w2i["<SOS>"]] + ids + [eng_w2i["<EOS>"]]
    src_tensor = torch.tensor([ids], dtype=torch.long).to(device)

    with torch.no_grad():
        encoder_outputs, hidden, cell = model.encoder(src_tensor)
        hidden = hidden.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)
        cell = cell.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)

        trg_indices = [fra_w2i["<SOS>"]]
        for _ in range(max_len):
            trg_tensor = torch.tensor([trg_indices[-1]], dtype=torch.long).to(device)
            output, hidden, cell = model.decoder(trg_tensor, hidden, cell, encoder_outputs)
            pred = output.argmax(1).item()
            trg_indices.append(pred)
            if pred == fra_w2i["<EOS>"]:
                break
    translated = [fra_i2w[idx] for idx in trg_indices[1:-1]]
    return " ".join(translated)

def beam_search_translate(model, sentence, eng_w2i, fra_w2i, fra_i2w, max_len=20, beam_size=3):
    model.eval()
    tokens = sentence.split()
    ids = [eng_w2i.get(w, eng_w2i["<UNK>"]) for w in tokens]
    ids = [eng_w2i["<SOS>"]] + ids + [eng_w2i["<EOS>"]]
    src_tensor = torch.tensor([ids], dtype=torch.long).to(device)

    with torch.no_grad():
        encoder_outputs, hidden, cell = model.encoder(src_tensor)
        hidden = hidden.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)
        cell = cell.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)

        sequences = [[[fra_w2i["<SOS>"]], 0.0, hidden, cell]]

        for _ in range(max_len):
            all_candidates = []
            for seq, score, h, c in sequences:
                if seq[-1] == fra_w2i["<EOS>"]:
                    all_candidates.append((seq, score, h, c))
                    continue
                trg_tensor = torch.tensor([seq[-1]], dtype=torch.long).to(device)
                output, h_new, c_new = model.decoder(trg_tensor, h, c, encoder_outputs)
                log_probs = torch.log_softmax(output, dim=1)
                topk_probs, topk_ids = log_probs.topk(beam_size)
                for i in range(beam_size):
                    new_seq = seq + [topk_ids[0, i].item()]
                    new_score = score + topk_probs[0, i].item()
                    all_candidates.append((new_seq, new_score, h_new, c_new))
            sequences = sorted(all_candidates, key=lambda x: x[1], reverse=True)[:beam_size]
            if all(seq[-1] == fra_w2i["<EOS>"] for seq, _, _, _ in sequences):
                break

        best_seq = sequences[0][0]
    translated = [fra_i2w[idx] for idx in best_seq[1:-1]]
    return " ".join(translated)

# ==================== 7. 测试示例 ====================
test_sentences = ["hello", "goodbye", "thank you", "how are you", "i love you"]
print("\n" + "="*50)
print("贪心解码结果：")
for sent in test_sentences:
    translation = greedy_translate(model, sent, eng_word2idx, fra_word2idx, fra_idx2word)
    print(f"English: {sent:20} -> French: {translation}")

print("\n" + "="*50)
print("束搜索解码结果（beam_size=3）：")
for sent in test_sentences:
    translation = beam_search_translate(model, sent, eng_word2idx, fra_word2idx, fra_idx2word, beam_size=3)
    print(f"English: {sent:20} -> French: {translation}")

输出：

(base) PS E:\Datawhale 2026\base-llm202602> & D:/Users/app/miniconda3/envs/base-llm/python.exe "e:/Datawhale 2026/base-llm202602/04_5_final_attention_translation.py"

正在加载数据...

使用句子对数: 20000

清洗后有效句子对数: 20000

训练集大小: 18000

验证集大小: 2000

英文词表大小: 4705

法文词表大小: 8974

最大英文序列长度: 7

最大法文序列长度: 13

使用设备: cuda

开始训练...

Epoch 1: train loss 6.1642, val loss 5.3252

-> 保存最佳模型 (val loss 5.3252)

Epoch 2: train loss 5.1295, val loss 5.1363

-> 保存最佳模型 (val loss 5.1363)

Epoch 3: train loss 4.8706, val loss 4.8918

-> 保存最佳模型 (val loss 4.8918)

Epoch 4: train loss 4.6043, val loss 4.6755

-> 保存最佳模型 (val loss 4.6755)

Epoch 5: train loss 4.3785, val loss 4.5047

-> 保存最佳模型 (val loss 4.5047)

Epoch 6: train loss 4.1942, val loss 4.3521

-> 保存最佳模型 (val loss 4.3521)

Epoch 7: train loss 4.0278, val loss 4.2139

-> 保存最佳模型 (val loss 4.2139)

Epoch 8: train loss 3.8684, val loss 4.0648

-> 保存最佳模型 (val loss 4.0648)

Epoch 9: train loss 3.7170, val loss 3.9445

-> 保存最佳模型 (val loss 3.9445)

Epoch 10: train loss 3.5827, val loss 3.8562

-> 保存最佳模型 (val loss 3.8562)

Epoch 11: train loss 3.4537, val loss 3.7461

-> 保存最佳模型 (val loss 3.7461)

Epoch 12: train loss 3.3334, val loss 3.6792

-> 保存最佳模型 (val loss 3.6792)

Epoch 13: train loss 3.2260, val loss 3.6001

-> 保存最佳模型 (val loss 3.6001)

Epoch 14: train loss 3.1182, val loss 3.5398

-> 保存最佳模型 (val loss 3.5398)

Epoch 15: train loss 3.0274, val loss 3.4788

-> 保存最佳模型 (val loss 3.4788)

Epoch 16: train loss 2.9290, val loss 3.4286

-> 保存最佳模型 (val loss 3.4286)

Epoch 17: train loss 2.8285, val loss 3.3697

-> 保存最佳模型 (val loss 3.3697)

Epoch 18: train loss 2.7453, val loss 3.3276

-> 保存最佳模型 (val loss 3.3276)

Epoch 19: train loss 2.6598, val loss 3.2985

-> 保存最佳模型 (val loss 3.2985)

Epoch 20: train loss 2.5787, val loss 3.2512

-> 保存最佳模型 (val loss 3.2512)

Epoch 21: train loss 2.4966, val loss 3.2068

-> 保存最佳模型 (val loss 3.2068)

Epoch 22: train loss 2.4226, val loss 3.1811

-> 保存最佳模型 (val loss 3.1811)

Epoch 23: train loss 2.3452, val loss 3.1391

-> 保存最佳模型 (val loss 3.1391)

Epoch 24: train loss 2.2718, val loss 3.1236

-> 保存最佳模型 (val loss 3.1236)

Epoch 25: train loss 2.1969, val loss 3.0862

-> 保存最佳模型 (val loss 3.0862)

Epoch 26: train loss 2.1282, val loss 3.0626

-> 保存最佳模型 (val loss 3.0626)

Epoch 27: train loss 2.0682, val loss 3.0404

-> 保存最佳模型 (val loss 3.0404)

Epoch 28: train loss 2.0040, val loss 3.0198

-> 保存最佳模型 (val loss 3.0198)

Epoch 29: train loss 1.9342, val loss 2.9815

-> 保存最佳模型 (val loss 2.9815)

Epoch 30: train loss 1.8676, val loss 2.9700

-> 保存最佳模型 (val loss 2.9700)

Epoch 31: train loss 1.8101, val loss 2.9407

-> 保存最佳模型 (val loss 2.9407)

Epoch 32: train loss 1.7690, val loss 2.9417

Epoch 33: train loss 1.6965, val loss 2.9134

-> 保存最佳模型 (val loss 2.9134)

Epoch 34: train loss 1.6549, val loss 2.8914

-> 保存最佳模型 (val loss 2.8914)

Epoch 35: train loss 1.5989, val loss 2.8864

-> 保存最佳模型 (val loss 2.8864)

Epoch 36: train loss 1.5485, val loss 2.8891

Epoch 37: train loss 1.5021, val loss 2.8673

-> 保存最佳模型 (val loss 2.8673)

Epoch 38: train loss 1.4520, val loss 2.8544

-> 保存最佳模型 (val loss 2.8544)

Epoch 39: train loss 1.4109, val loss 2.8433

-> 保存最佳模型 (val loss 2.8433)

Epoch 40: train loss 1.3614, val loss 2.8323

-> 保存最佳模型 (val loss 2.8323)

Epoch 41: train loss 1.3217, val loss 2.8206

-> 保存最佳模型 (val loss 2.8206)

Epoch 42: train loss 1.2795, val loss 2.8172

-> 保存最佳模型 (val loss 2.8172)

Epoch 43: train loss 1.2408, val loss 2.8145

-> 保存最佳模型 (val loss 2.8145)

Epoch 44: train loss 1.2071, val loss 2.8082

-> 保存最佳模型 (val loss 2.8082)

Epoch 45: train loss 1.1628, val loss 2.8060

-> 保存最佳模型 (val loss 2.8060)

Epoch 46: train loss 1.1196, val loss 2.8013

-> 保存最佳模型 (val loss 2.8013)

Epoch 47: train loss 1.0818, val loss 2.7826

-> 保存最佳模型 (val loss 2.7826)

Epoch 48: train loss 1.0490, val loss 2.7831

Epoch 49: train loss 1.0176, val loss 2.7735

-> 保存最佳模型 (val loss 2.7735)

Epoch 50: train loss 0.9910, val loss 2.7766

Epoch 51: train loss 0.9626, val loss 2.7810

Epoch 52: train loss 0.9269, val loss 2.7739

Epoch 53: train loss 0.8764, val loss 2.7691

-> 保存最佳模型 (val loss 2.7691)

Epoch 54: train loss 0.8431, val loss 2.7712

Epoch 55: train loss 0.8269, val loss 2.7684

-> 保存最佳模型 (val loss 2.7684)

Epoch 56: train loss 0.8258, val loss 2.7806

Epoch 57: train loss 0.8093, val loss 2.7736

Epoch 58: train loss 0.7899, val loss 2.7817

Epoch 59: train loss 0.7639, val loss 2.7747

Epoch 60: train loss 0.7481, val loss 2.7698

验证损失连续 5 轮未下降，提前停止训练。

训练完成，最佳验证损失: 2.7684

==================================================

贪心解码结果：

English: hello -> French: vous te !

English: goodbye -> French: y un ?

English: thank you -> French: est-ce à vous ?

English: how are you -> French: comment allez-vous le ?

English: i love you -> French: je mecs, vous vous

==================================================

束搜索解码结果（beam_size=3）：

English: hello -> French: à te !

English: goodbye -> French: que que

English: thank you -> French: est-ce à vous ?

English: how are you -> French: comment allez-vous le ?

English: i love you -> French: je mecs, vous vous

输出解读：

模型训练已经取得了明显进步：验证损失从最初的 5.3 降到了 2.76，说明模型确实在学习。然而，翻译结果仍然不理想，这主要源于以下原因：

数据量仍偏少 ：20,000 句对对于训练一个深度神经网络翻译模型来说是不够的。神经机器翻译通常需要几十万到百万级别的平行语料才能达到较好的质量。您可以尝试使用全部数据（fra.txt 通常有 20 万对以上）。
模型容量可能过大 ：hidden_size=256、num_layers=3 对于 2 万数据来说仍可能导致过拟合。验证损失 2.76 而训练损失 0.74，差距明显，说明模型在训练集上"死记硬背"，但泛化能力弱。
数据中保留标点符号 ：法语句子中的标点（如 !, ?）增加了学习难度，模型需要同时学习词汇和标点。可以先去掉标点，让模型专注于词汇对应，等基础翻译质量提升后再引入标点。
学习率策略 ：您使用了 ReduceLROnPlateau，但验证损失下降缓慢，可能需要更长的训练或不同的调度策略。

🚀 下一步改进方案

1. 使用全部数据（如果文件完整）

将 num_pairs 改为使用全部数据（去掉切片），例如：

python

复制代码

raw_pairs = [line.strip().split("\t") for line in lines]  # 使用全部

2. 降低模型复杂度

将 hidden_size 减小到 128，num_layers 减小到 2，dropout 增大到 0.5。

3. 简化数据（去掉标点）

修改 clean_text 函数，去掉所有标点符号：

python

复制代码

def clean_text(text):
    text = text.lower().strip()
    text = re.sub(r"[^\w\s]", "", text)  # 去掉标点
    text = text.replace('\u00A0', ' ')
    return text

4. 增加训练轮数并监控验证损失

将 epochs 设为 200，并保持早停机制，让模型有足够时间收敛。

5. 尝试使用更大的批次（如果显存允许）

batch_size=128 已经不错，但如果显存允许，可尝试 256。

python 复制代码

"""
final_attention_translation_full.py
完整版带注意力的英法翻译模型（使用全部数据，去除标点，降低复杂度）
- 使用全部 fra.txt 数据
- 文本清洗：小写、去除标点、处理特殊空格
- 模型参数：hidden_size=128, num_layers=2, dropout=0.5
- 训练200轮，早停机制，保存最佳模型
- 包含贪心解码和束搜索推理
"""

import random
import re
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from sklearn.model_selection import train_test_split

# ==================== 1. 设置随机种子 ====================
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

set_seed(42)

# ==================== 2. 数据准备 ====================
print("正在加载数据...")
with open("fra.txt", "r", encoding="utf-8") as f:
    lines = f.readlines()

# 使用全部数据（去掉切片）
raw_pairs = [line.strip().split("\t") for line in lines]
print(f"总句子对数: {len(raw_pairs)}")

# 清洗函数：小写、去除标点、替换特殊空格
def clean_text(text):
    text = text.lower().strip()
    text = re.sub(r"[^\w\s]", "", text)  # 去除标点符号
    text = text.replace('\u00A0', ' ')   # 替换不换行空格
    return text

pairs = [(clean_text(eng), clean_text(fra)) for eng, fra, *_ in raw_pairs]
print(f"清洗后有效句子对数: {len(pairs)}")

# 划分训练集和验证集
train_pairs, val_pairs = train_test_split(pairs, test_size=0.1, random_state=42)
print(f"训练集大小: {len(train_pairs)}")
print(f"验证集大小: {len(val_pairs)}")

# 构建词汇表
def build_vocab(sentences):
    word2idx = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2, "<UNK>": 3}
    idx2word = {0: "<PAD>", 1: "<SOS>", 2: "<EOS>", 3: "<UNK>"}
    for sent in sentences:
        for word in sent.split():
            if word not in word2idx:
                idx = len(word2idx)
                word2idx[word] = idx
                idx2word[idx] = word
    return word2idx, idx2word

# 用所有句子构建词汇表（保证验证集不会遇到未登录词）
eng_sentences = [p[0] for p in pairs]
fra_sentences = [p[1] for p in pairs]
eng_word2idx, eng_idx2word = build_vocab(eng_sentences)
fra_word2idx, fra_idx2word = build_vocab(fra_sentences)

print(f"英文词表大小: {len(eng_word2idx)}")
print(f"法文词表大小: {len(fra_word2idx)}")

# 句子转 ID 序列（添加 SOS 和 EOS，并填充到最大长度）
def sentence_to_ids(sentence, word2idx, max_len=None):
    words = sentence.split()
    ids = [word2idx.get(w, word2idx["<UNK>"]) for w in words]
    ids = [word2idx["<SOS>"]] + ids + [word2idx["<EOS>"]]
    if max_len is not None:
        if len(ids) > max_len:
            ids = ids[:max_len-1] + [word2idx["<EOS>"]]
        else:
            ids += [word2idx["<PAD>"]] * (max_len - len(ids))
    return ids

# 计算最大长度（基于所有句子，用于填充）
max_eng_len = max(len(s.split()) for s in eng_sentences) + 2
max_fra_len = max(len(s.split()) for s in fra_sentences) + 2
print(f"最大英文序列长度: {max_eng_len}")
print(f"最大法文序列长度: {max_fra_len}")

# 数据集类
class TranslationDataset(Dataset):
    def __init__(self, pairs, eng_w2i, fra_w2i, max_eng_len, max_fra_len):
        self.pairs = pairs
        self.eng_w2i = eng_w2i
        self.fra_w2i = fra_w2i
        self.max_eng_len = max_eng_len
        self.max_fra_len = max_fra_len

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx):
        eng, fra = self.pairs[idx]
        eng_ids = sentence_to_ids(eng, self.eng_w2i, self.max_eng_len)
        fra_ids = sentence_to_ids(fra, self.fra_w2i, self.max_fra_len)
        return {
            "eng": torch.tensor(eng_ids, dtype=torch.long),
            "fra": torch.tensor(fra_ids, dtype=torch.long)
        }

# collate_fn：将批次中的样本堆叠
def collate_fn(batch):
    eng_batch = torch.stack([item["eng"] for item in batch])
    fra_batch = torch.stack([item["fra"] for item in batch])
    return {"src": eng_batch, "trg": fra_batch}

# 超参数
batch_size = 128  # 可根据显存调整
train_dataset = TranslationDataset(train_pairs, eng_word2idx, fra_word2idx, max_eng_len, max_fra_len)
val_dataset = TranslationDataset(val_pairs, eng_word2idx, fra_word2idx, max_eng_len, max_fra_len)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

# ==================== 3. 定义模型组件 ====================
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"使用设备: {device}")

class Encoder(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers, dropout=0.5):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size, hidden_size, num_layers,
                           batch_first=True, bidirectional=True,
                           dropout=dropout if num_layers > 1 else 0)
        self.fc = nn.Linear(hidden_size * 2, hidden_size)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, (hidden, cell) = self.rnn(embedded)
        outputs = torch.tanh(self.fc(outputs))
        return outputs, hidden, cell

class AttentionParams(nn.Module):
    def __init__(self, hidden_size):
        super(AttentionParams, self).__init__()
        self.attn = nn.Linear(hidden_size * 2, hidden_size)
        self.v = nn.Parameter(torch.randn(hidden_size))

    def forward(self, hidden, encoder_outputs):
        batch_size, src_len = encoder_outputs.shape[:2]
        hidden_last = hidden[-1].unsqueeze(1).repeat(1, src_len, 1)
        combined = torch.cat((hidden_last, encoder_outputs), dim=2)
        energy = torch.tanh(self.attn(combined))
        scores = torch.einsum("bsh,h->bs", energy, self.v)
        attn_weights = torch.softmax(scores, dim=1)
        return attn_weights

class DecoderWithAttention(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_layers, attention_module, dropout=0.5):
        super(DecoderWithAttention, self).__init__()
        self.attention = attention_module
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.rnn = nn.LSTM(hidden_size * 2, hidden_size, num_layers,
                           batch_first=True,
                           dropout=dropout if num_layers > 1 else 0)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, hidden, cell, encoder_outputs):
        embedded = self.embedding(x.unsqueeze(1))
        attn_weights = self.attention(hidden, encoder_outputs)
        context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs)
        rnn_input = torch.cat((embedded, context), dim=2)
        output, (hidden, cell) = self.rnn(rnn_input, (hidden, cell))
        predictions = self.fc(output.squeeze(1))
        return predictions, hidden, cell

class Seq2SeqAttention(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2SeqAttention, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size, trg_len = trg.shape
        trg_vocab_size = self.decoder.fc.out_features
        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

        encoder_outputs, hidden, cell = self.encoder(src)

        # 合并双向状态
        hidden = hidden.view(self.encoder.rnn.num_layers, 2, batch_size, -1).sum(dim=1)
        cell = cell.view(self.encoder.rnn.num_layers, 2, batch_size, -1).sum(dim=1)

        input = trg[:, 0]  # <SOS>

        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input, hidden, cell, encoder_outputs)
            outputs[:, t, :] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[:, t] if teacher_force else top1

        return outputs

# ==================== 4. 训练准备 ====================
hidden_size = 128
num_layers = 2
dropout = 0.5
learning_rate = 0.0002
epochs = 200

attention_module = AttentionParams(hidden_size)
encoder = Encoder(len(eng_word2idx), hidden_size, num_layers, dropout)
decoder = DecoderWithAttention(len(fra_word2idx), hidden_size, num_layers, attention_module, dropout)
model = Seq2SeqAttention(encoder, decoder, device).to(device)

criterion = nn.CrossEntropyLoss(ignore_index=0)  # 忽略 <PAD>
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# 学习率调度器（可选）
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=2)

# ==================== 5. 训练循环（带验证和早停）====================
best_val_loss = float('inf')
patience = 5
patience_counter = 0

print("开始训练...")
for epoch in range(1, epochs+1):
    # 训练阶段
    model.train()
    train_loss = 0.0
    for batch in train_loader:
        src = batch["src"].to(device)
        trg = batch["trg"].to(device)

        optimizer.zero_grad()
        output = model(src, trg, teacher_forcing_ratio=0.5)

        output_dim = output.shape[-1]
        output = output[:, 1:, :].reshape(-1, output_dim)
        trg = trg[:, 1:].reshape(-1)

        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5)
        optimizer.step()

        train_loss += loss.item()

    avg_train_loss = train_loss / len(train_loader)

    # 验证阶段（使用真实目标作为输入，teacher_forcing_ratio=1）
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for batch in val_loader:
            src = batch["src"].to(device)
            trg = batch["trg"].to(device)
            output = model(src, trg, teacher_forcing_ratio=1)  # 使用真实目标
            output_dim = output.shape[-1]
            output = output[:, 1:, :].reshape(-1, output_dim)
            trg = trg[:, 1:].reshape(-1)
            loss = criterion(output, trg)
            val_loss += loss.item()
    avg_val_loss = val_loss / len(val_loader)

    # 更新学习率
    scheduler.step(avg_val_loss)

    print(f"Epoch {epoch:3d}: train loss {avg_train_loss:.4f}, val loss {avg_val_loss:.4f}")

    # 保存最佳模型
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        torch.save(model.state_dict(), "best_model.pth")
        print(f"  -> 保存最佳模型 (val loss {best_val_loss:.4f})")
        patience_counter = 0
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"验证损失连续 {patience} 轮未下降，提前停止训练。")
            break

# 加载最佳模型用于推理
model.load_state_dict(torch.load("best_model.pth"))
print(f"训练完成，最佳验证损失: {best_val_loss:.4f}")

# ==================== 6. 推理函数 ====================
def greedy_translate(model, sentence, eng_w2i, fra_w2i, fra_i2w, max_len=20):
    model.eval()
    # 对输入句子进行同样的清洗（去除标点）
    sentence = clean_text(sentence)
    tokens = sentence.split()
    ids = [eng_w2i.get(w, eng_w2i["<UNK>"]) for w in tokens]
    ids = [eng_w2i["<SOS>"]] + ids + [eng_w2i["<EOS>"]]
    src_tensor = torch.tensor([ids], dtype=torch.long).to(device)

    with torch.no_grad():
        encoder_outputs, hidden, cell = model.encoder(src_tensor)
        hidden = hidden.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)
        cell = cell.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)

        trg_indices = [fra_w2i["<SOS>"]]
        for _ in range(max_len):
            trg_tensor = torch.tensor([trg_indices[-1]], dtype=torch.long).to(device)
            output, hidden, cell = model.decoder(trg_tensor, hidden, cell, encoder_outputs)
            pred = output.argmax(1).item()
            trg_indices.append(pred)
            if pred == fra_w2i["<EOS>"]:
                break
    translated = [fra_i2w[idx] for idx in trg_indices[1:-1]]
    return " ".join(translated)

def beam_search_translate(model, sentence, eng_w2i, fra_w2i, fra_i2w, max_len=20, beam_size=3):
    model.eval()
    sentence = clean_text(sentence)
    tokens = sentence.split()
    ids = [eng_w2i.get(w, eng_w2i["<UNK>"]) for w in tokens]
    ids = [eng_w2i["<SOS>"]] + ids + [eng_w2i["<EOS>"]]
    src_tensor = torch.tensor([ids], dtype=torch.long).to(device)

    with torch.no_grad():
        encoder_outputs, hidden, cell = model.encoder(src_tensor)
        hidden = hidden.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)
        cell = cell.view(model.encoder.rnn.num_layers, 2, 1, -1).sum(dim=1)

        sequences = [[[fra_w2i["<SOS>"]], 0.0, hidden, cell]]

        for _ in range(max_len):
            all_candidates = []
            for seq, score, h, c in sequences:
                if seq[-1] == fra_w2i["<EOS>"]:
                    all_candidates.append((seq, score, h, c))
                    continue
                trg_tensor = torch.tensor([seq[-1]], dtype=torch.long).to(device)
                output, h_new, c_new = model.decoder(trg_tensor, h, c, encoder_outputs)
                log_probs = torch.log_softmax(output, dim=1)
                topk_probs, topk_ids = log_probs.topk(beam_size)
                for i in range(beam_size):
                    new_seq = seq + [topk_ids[0, i].item()]
                    new_score = score + topk_probs[0, i].item()
                    all_candidates.append((new_seq, new_score, h_new, c_new))
            sequences = sorted(all_candidates, key=lambda x: x[1], reverse=True)[:beam_size]
            if all(seq[-1] == fra_w2i["<EOS>"] for seq, _, _, _ in sequences):
                break

        best_seq = sequences[0][0]
    translated = [fra_i2w[idx] for idx in best_seq[1:-1]]
    return " ".join(translated)

# ==================== 7. 测试示例 ====================
test_sentences = ["hello", "goodbye", "thank you", "how are you", "i love you"]
print("\n" + "="*50)
print("贪心解码结果：")
for sent in test_sentences:
    translation = greedy_translate(model, sent, eng_word2idx, fra_word2idx, fra_idx2word)
    print(f"English: {sent:20} -> French: {translation}")

print("\n" + "="*50)
print("束搜索解码结果（beam_size=3）：")
for sent in test_sentences:
    translation = beam_search_translate(model, sent, eng_word2idx, fra_word2idx, fra_idx2word, beam_size=3)
    print(f"English: {sent:20} -> French: {translation}")

10:00开始跑这个代码，到第二天14:33跑完。

(base) PS E:\Datawhale 2026\base-llm202602> & D:/Users/app/miniconda3/envs/base-llm/python.exe "e:/Datawhale 2026/base-llm202602/04_6_final_attention_translation_full.py"

正在加载数据...

总句子对数: 240521

清洗后有效句子对数: 240521

训练集大小: 216468

验证集大小: 24053

英文词表大小: 17299

法文词表大小: 33756

最大英文序列长度: 57

最大法文序列长度: 58

使用设备: cuda

开始训练...

Epoch 1: train loss 5.8642, val loss 5.2208

-> 保存最佳模型 (val loss 5.2208)

Epoch 2: train loss 4.9911, val loss 4.6263

-> 保存最佳模型 (val loss 4.6263)

Epoch 3: train loss 4.5548, val loss 4.2255

-> 保存最佳模型 (val loss 4.2255)

Epoch 4: train loss 4.2360, val loss 3.9245

-> 保存最佳模型 (val loss 3.9245)

Epoch 5: train loss 3.9903, val loss 3.7016

-> 保存最佳模型 (val loss 3.7016)

Epoch 6: train loss 3.7866, val loss 3.5170

-> 保存最佳模型 (val loss 3.5170)

Epoch 7: train loss 3.6141, val loss 3.3546

-> 保存最佳模型 (val loss 3.3546)

Epoch 8: train loss 3.4624, val loss 3.2160

-> 保存最佳模型 (val loss 3.2160)

Epoch 9: train loss 3.3280, val loss 3.1001

-> 保存最佳模型 (val loss 3.1001)

Epoch 10: train loss 3.2132, val loss 2.9958

-> 保存最佳模型 (val loss 2.9958)

Epoch 11: train loss 3.1029, val loss 2.8932

-> 保存最佳模型 (val loss 2.8932)

Epoch 12: train loss 3.0070, val loss 2.8188

-> 保存最佳模型 (val loss 2.8188)

Epoch 13: train loss 2.9129, val loss 2.7280

-> 保存最佳模型 (val loss 2.7280)

Epoch 14: train loss 2.8317, val loss 2.6633

-> 保存最佳模型 (val loss 2.6633)

Epoch 15: train loss 2.7587, val loss 2.6048

-> 保存最佳模型 (val loss 2.6048)

Epoch 16: train loss 2.6878, val loss 2.5465

-> 保存最佳模型 (val loss 2.5465)

Epoch 17: train loss 2.6274, val loss 2.5021

-> 保存最佳模型 (val loss 2.5021)

Epoch 18: train loss 2.5570, val loss 2.4382

-> 保存最佳模型 (val loss 2.4382)

Epoch 19: train loss 2.5051, val loss 2.3987

-> 保存最佳模型 (val loss 2.3987)

Epoch 20: train loss 2.4447, val loss 2.3601

-> 保存最佳模型 (val loss 2.3601)

Epoch 21: train loss 2.3981, val loss 2.3314

-> 保存最佳模型 (val loss 2.3314)

Epoch 22: train loss 2.3475, val loss 2.2793

-> 保存最佳模型 (val loss 2.2793)

Epoch 23: train loss 2.2994, val loss 2.2537

-> 保存最佳模型 (val loss 2.2537)

Epoch 24: train loss 2.2601, val loss 2.2260

-> 保存最佳模型 (val loss 2.2260)

Epoch 25: train loss 2.2196, val loss 2.1973

-> 保存最佳模型 (val loss 2.1973)

Epoch 26: train loss 2.1794, val loss 2.1648

-> 保存最佳模型 (val loss 2.1648)

Epoch 27: train loss 2.1439, val loss 2.1509

-> 保存最佳模型 (val loss 2.1509)

Epoch 28: train loss 2.1101, val loss 2.1328

-> 保存最佳模型 (val loss 2.1328)

Epoch 29: train loss 2.0833, val loss 2.1071

-> 保存最佳模型 (val loss 2.1071)

Epoch 30: train loss 2.0484, val loss 2.0897

-> 保存最佳模型 (val loss 2.0897)

Epoch 31: train loss 2.0172, val loss 2.0671

-> 保存最佳模型 (val loss 2.0671)

Epoch 32: train loss 1.9899, val loss 2.0531

-> 保存最佳模型 (val loss 2.0531)

Epoch 33: train loss 1.9612, val loss 2.0340

-> 保存最佳模型 (val loss 2.0340)

Epoch 34: train loss 1.9380, val loss 2.0161

-> 保存最佳模型 (val loss 2.0161)

Epoch 35: train loss 1.9135, val loss 1.9973

-> 保存最佳模型 (val loss 1.9973)

Epoch 36: train loss 1.8884, val loss 2.0049

Epoch 37: train loss 1.8634, val loss 1.9807

-> 保存最佳模型 (val loss 1.9807)

Epoch 38: train loss 1.8383, val loss 1.9635

-> 保存最佳模型 (val loss 1.9635)

Epoch 39: train loss 1.8196, val loss 1.9535

-> 保存最佳模型 (val loss 1.9535)

Epoch 40: train loss 1.7994, val loss 1.9463

-> 保存最佳模型 (val loss 1.9463)

Epoch 41: train loss 1.7807, val loss 1.9364

-> 保存最佳模型 (val loss 1.9364)

Epoch 42: train loss 1.7637, val loss 1.9275

-> 保存最佳模型 (val loss 1.9275)

Epoch 43: train loss 1.7396, val loss 1.9174

-> 保存最佳模型 (val loss 1.9174)

Epoch 44: train loss 1.7098, val loss 1.9016

-> 保存最佳模型 (val loss 1.9016)

Epoch 45: train loss 1.7030, val loss 1.8975

-> 保存最佳模型 (val loss 1.8975)

Epoch 46: train loss 1.6872, val loss 1.8996

Epoch 47: train loss 1.6673, val loss 1.8779

-> 保存最佳模型 (val loss 1.8779)

Epoch 48: train loss 1.6564, val loss 1.8797

Epoch 49: train loss 1.6392, val loss 1.8740

-> 保存最佳模型 (val loss 1.8740)

Epoch 50: train loss 1.6264, val loss 1.8675

-> 保存最佳模型 (val loss 1.8675)

Epoch 51: train loss 1.5999, val loss 1.8537

-> 保存最佳模型 (val loss 1.8537)

Epoch 52: train loss 1.5912, val loss 1.8507

-> 保存最佳模型 (val loss 1.8507)

Epoch 53: train loss 1.5745, val loss 1.8427

-> 保存最佳模型 (val loss 1.8427)

Epoch 54: train loss 1.5606, val loss 1.8502

Epoch 55: train loss 1.5489, val loss 1.8410

-> 保存最佳模型 (val loss 1.8410)

Epoch 56: train loss 1.5336, val loss 1.8260

-> 保存最佳模型 (val loss 1.8260)

Epoch 57: train loss 1.5169, val loss 1.8284

Epoch 58: train loss 1.5176, val loss 1.8216

-> 保存最佳模型 (val loss 1.8216)

Epoch 59: train loss 1.4991, val loss 1.8287

Epoch 60: train loss 1.4904, val loss 1.8110

-> 保存最佳模型 (val loss 1.8110)

Epoch 61: train loss 1.4775, val loss 1.8124

Epoch 62: train loss 1.4679, val loss 1.8150

Epoch 63: train loss 1.4583, val loss 1.8007

-> 保存最佳模型 (val loss 1.8007)

Epoch 64: train loss 1.4392, val loss 1.8004

-> 保存最佳模型 (val loss 1.8004)

Epoch 65: train loss 1.4412, val loss 1.7997

-> 保存最佳模型 (val loss 1.7997)

Epoch 66: train loss 1.4219, val loss 1.7898

-> 保存最佳模型 (val loss 1.7898)

Epoch 67: train loss 1.4160, val loss 1.7942

Epoch 68: train loss 1.4057, val loss 1.7866

-> 保存最佳模型 (val loss 1.7866)

Epoch 69: train loss 1.3968, val loss 1.7915

Epoch 70: train loss 1.3862, val loss 1.7852

-> 保存最佳模型 (val loss 1.7852)

Epoch 71: train loss 1.3818, val loss 1.7857

Epoch 72: train loss 1.3670, val loss 1.7871

Epoch 73: train loss 1.3633, val loss 1.7900

Epoch 74: train loss 1.3222, val loss 1.7676

-> 保存最佳模型 (val loss 1.7676)

Epoch 75: train loss 1.3127, val loss 1.7763

Epoch 76: train loss 1.3102, val loss 1.7691

Epoch 77: train loss 1.3004, val loss 1.7698

Epoch 78: train loss 1.2843, val loss 1.7700

Epoch 79: train loss 1.2771, val loss 1.7688

验证损失连续 5 轮未下降，提前停止训练。

训练完成，最佳验证损失: 1.7676

==================================================

贪心解码结果：

English: hello -> French: sauf

English: goodbye -> French: hurler

English: thank you -> French: pour lui être

English: how are you -> French: quelle vous nétais pas

English: i love you -> French: mercure

==================================================

束搜索解码结果（beam_size=3）：

English: hello -> French: personnelle

English: goodbye -> French: hurler

English: thank you -> French: juste

English: how are you -> French: que nêtesvous pas

English: i love you -> French: possible

📊 训练结果分析

1. 训练过程正常

验证损失从 5.22 一路下降到 1.7676，说明模型确实在学习。
早停机制在第 79 轮触发，因为连续 5 轮没有创新低，模型收敛了。

2. 翻译结果为什么这么差？

text

复制代码

贪心解码：
hello    -> sauf (除了)
goodbye  -> hurler (嚎叫)
thank you -> pour lui être (为了他成为)
how are you -> quelle vous nétais pas (什么你们不是)
i love you -> mercure (水星)

束搜索：
hello    -> personnelle (个人的)
goodbye  -> hurler (嚎叫)
thank you -> juste (仅仅)
how are you -> que nêtesvous pas (什么你们不是)
i love you -> possible (可能的)

完全不对！ 这说明模型虽然损失很低，但没有学会翻译，只是死记硬背了一些训练集中的词。

🔍 为什么会这样？

核心原因：数据预处理与推理不一致

可能的原因：

词汇表问题 ：训练时，法文句子去掉了标点，所以法文词表中没有 !、? 等符号。但推理时，模型生成的法文输出中不应该出现标点，而测试句子里英文也没有标点，所以标点不是问题。
更可能的原因 ：训练数据中，法文句子原本是带标点的（如 Va !），但用 clean_text 去掉了标点，变成了 Va 和 ! 两个词？不，re.sub(r"[^\w\s]", "", text) 会去掉标点，所以 Va ! 会变成 Va（一个词）。这没问题。
根本原因 ：模型可能只记住了训练集中的高频词和常见短语，对于测试句子这种简单句子，它没有学到正确的映射。例如，"hello" 在训练集中可能对应多种翻译（bonjour, salut），但模型可能只记住了与 hello 共现的其他词，导致输出随机。

第三节深入解析 Transformer

一、为什么需要 Transformer？

之前学的 RNN 和 LSTM 有一个问题：它们必须一个词一个词地按顺序读，就像读书必须一个字一个字往下看，不能同时看整页。这样读很慢，而且如果句子很长，开头的内容可能会被忘记。

Transformer 就像一个有超能力的读者，它可以一眼看完整个句子 ，并且还能理解每个词和其他所有词的关系。这都靠一种叫 自注意力 的机制。

二、自注意力：让每个词看看所有词

想象一下，读一句话："苹果公司发布了新款手机，它采用了最新的芯片。" 要理解"它"指的是谁，就需要回头看看前面的词。自注意力就是做这件事：对于句子中的每个词，它都会看看所有词（包括自己），然后决定哪些词和自己关系最紧密。

具体怎么做呢？有三个角色：

Query（查询）：当前词想知道自己应该关注谁。
Key（键）：每个词都贴一个标签，告诉别人自己有什么信息。
Value（值）：每个词的实际内容。

计算过程就像：用当前词的 Query 去和所有词的 Key 比较，得到"相关分数"，然后用这些分数去加权所有词的 Value，最后得到一个融合了上下文的新词表示。

三、多头注意力：从多个角度思考

一个注意力头可能只关注一种关系（比如"它"指代谁）。但句子中的关系很多，比如主语、宾语、形容词等。所以 Transformer 用了 多个头，每个头学不同的关系，最后把它们的想法合并起来，就像请了好几个专家一起讨论。

四、位置编码：给词加上顺序

自注意力本身不关心词的顺序，比如"猫追狗"和"狗追猫"会被它看成一样的。所以我们需要给每个词加上一个 位置编码，告诉模型这个词在句子的第几个位置。Transformer 用三角函数或可学习的向量来生成位置编码。

五、Transformer 的整体结构

Transformer 依然分成编码器 和解码器两大部分：

编码器 ：由多个相同的层堆叠而成，每层包含一个多头自注意力 和一个前馈网络 ，每个子层都有残差连接 和层归一化。编码器的作用是理解整个输入句子，为每个词生成包含上下文信息的表示。
解码器 ：也由多个相同的层堆叠而成，每层比编码器多一个交叉注意力 子层。解码器在生成输出时，先用带掩码的自注意力（防止看到未来的词），再用交叉注意力关注编码器的输出，最后经过前馈网络。

六、关键组件

残差连接：把输入加到输出上，帮助梯度传递，让网络可以更深。
层归一化：对每个词的特征做标准化，让训练更稳定。
前馈网络：一个简单的两层全连接网络，给模型增加非线性能力。
掩码：在解码器自注意力中，遮住未来的词，保证生成时只能看前面。

transformer_single.py

python 复制代码

"""
transformer_single.py
单文件版简化 Transformer 实现（包含所有组件）
运行前请确保已安装 PyTorch
"""

import torch
import torch.nn as nn
import math

# ==================== 1. 层归一化 ====================
class LayerNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(dim))
        self.beta = nn.Parameter(torch.zeros(dim))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        var = x.var(-1, keepdim=True, unbiased=False)
        x_norm = (x - mean) / torch.sqrt(var + self.eps)
        return self.gamma * x_norm + self.beta

# ==================== 2. 位置编码 ====================
class PositionalEncoding(nn.Module):
    def __init__(self, dim, max_seq_len=5000):
        super().__init__()
        pe = torch.zeros(max_seq_len, dim)
        position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, dim, 2).float() * (-math.log(10000.0) / dim))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # [1, max_seq_len, dim]
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:, :x.size(1), :]

# ==================== 3. 多头注意力 ====================
class MultiHeadAttention(nn.Module):
    def __init__(self, dim, n_heads, dropout=0.1):
        super().__init__()
        assert dim % n_heads == 0
        self.dim = dim
        self.n_heads = n_heads
        self.head_dim = dim // n_heads

        self.wq = nn.Linear(dim, dim)
        self.wk = nn.Linear(dim, dim)
        self.wv = nn.Linear(dim, dim)
        self.wo = nn.Linear(dim, dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)

        q = q.view(batch_size, -1, self.n_heads, self.head_dim).transpose(1, 2)
        k = k.view(batch_size, -1, self.n_heads, self.head_dim).transpose(1, 2)
        v = v.view(batch_size, -1, self.n_heads, self.head_dim).transpose(1, 2)

        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = torch.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        context = torch.matmul(attn_weights, v)
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.dim)
        output = self.wo(context)
        return output

# ==================== 4. 前馈网络 ====================
class FeedForward(nn.Module):
    def __init__(self, dim, hidden_dim, dropout=0.1):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim)
        self.w2 = nn.Linear(hidden_dim, dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w2(self.dropout(torch.relu(self.w1(x))))

# ==================== 5. 编码器层 ====================
class EncoderLayer(nn.Module):
    def __init__(self, dim, n_heads, hidden_dim, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(dim, n_heads, dropout)
        self.attention_norm = LayerNorm(dim)
        self.feed_forward = FeedForward(dim, hidden_dim, dropout)
        self.ffn_norm = LayerNorm(dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        _x = x
        x = self.attention(x, x, x, mask)
        x = self.attention_norm(_x + self.dropout(x))

        _x = x
        x = self.feed_forward(x)
        x = self.ffn_norm(_x + self.dropout(x))
        return x

# ==================== 6. 解码器层 ====================
class DecoderLayer(nn.Module):
    def __init__(self, dim, n_heads, hidden_dim, dropout=0.1):
        super().__init__()
        self.self_attention = MultiHeadAttention(dim, n_heads, dropout)
        self.self_attn_norm = LayerNorm(dim)
        self.cross_attention = MultiHeadAttention(dim, n_heads, dropout)
        self.cross_attn_norm = LayerNorm(dim)
        self.feed_forward = FeedForward(dim, hidden_dim, dropout)
        self.ffn_norm = LayerNorm(dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, enc_output, src_mask, tgt_mask):
        _x = x
        x = self.self_attention(x, x, x, tgt_mask)
        x = self.self_attn_norm(_x + self.dropout(x))

        _x = x
        x = self.cross_attention(x, enc_output, enc_output, src_mask)
        x = self.cross_attn_norm(_x + self.dropout(x))

        _x = x
        x = self.feed_forward(x)
        x = self.ffn_norm(_x + self.dropout(x))
        return x

# ==================== 7. Transformer 主类 ====================
class Transformer(nn.Module):
    def __init__(self,
                 src_vocab_size,
                 tgt_vocab_size,
                 dim=512,
                 n_heads=8,
                 n_layers=6,
                 hidden_dim=2048,
                 max_seq_len=5000,
                 dropout=0.1):
        super().__init__()
        self.dim = dim

        self.src_embedding = nn.Embedding(src_vocab_size, dim)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, dim)
        self.pos_encoder = PositionalEncoding(dim, max_seq_len)
        self.dropout = nn.Dropout(dropout)

        self.encoder_layers = nn.ModuleList([
            EncoderLayer(dim, n_heads, hidden_dim, dropout) for _ in range(n_layers)
        ])
        self.decoder_layers = nn.ModuleList([
            DecoderLayer(dim, n_heads, hidden_dim, dropout) for _ in range(n_layers)
        ])

        self.output = nn.Linear(dim, tgt_vocab_size)
        self._init_parameters()

    def _init_parameters(self):
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def generate_mask(self, src, tgt):
        # 假设填充 token 为 0
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)  # [batch, 1, 1, src_len]

        tgt_len = tgt.size(1)
        tgt_pad_mask = (tgt != 0).unsqueeze(1).unsqueeze(2)  # [batch, 1, 1, tgt_len]
        tgt_subsequent_mask = torch.tril(torch.ones((tgt_len, tgt_len), device=tgt.device)).bool()
        tgt_mask = tgt_pad_mask & tgt_subsequent_mask.unsqueeze(0)  # [batch, 1, tgt_len, tgt_len]
        return src_mask, tgt_mask

    def encode(self, src, src_mask):
        x = self.src_embedding(src) * math.sqrt(self.dim)
        x = self.pos_encoder(x)
        x = self.dropout(x)
        for layer in self.encoder_layers:
            x = layer(x, src_mask)
        return x

    def decode(self, tgt, enc_output, src_mask, tgt_mask):
        x = self.tgt_embedding(tgt) * math.sqrt(self.dim)
        x = self.pos_encoder(x)
        x = self.dropout(x)
        for layer in self.decoder_layers:
            x = layer(x, enc_output, src_mask, tgt_mask)
        return x

    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        enc_output = self.encode(src, src_mask)
        dec_output = self.decode(tgt, enc_output, src_mask, tgt_mask)
        logits = self.output(dec_output)
        return logits

# ==================== 8. 主程序测试 ====================
if __name__ == "__main__":
    # 超参数
    src_vocab_size = 100
    tgt_vocab_size = 100
    dim = 512
    n_heads = 8
    n_layers = 6
    hidden_dim = 2048
    max_seq_len = 50
    dropout = 0.1

    # 创建模型
    model = Transformer(
        src_vocab_size,
        tgt_vocab_size,
        dim,
        n_heads,
        n_layers,
        hidden_dim,
        max_seq_len,
        dropout
    )

    # 随机生成输入（假设填充 token 为 0，这里用非0随机数避免全部填充）
    batch_size = 2
    src_len = 10
    tgt_len = 12
    src = torch.randint(1, src_vocab_size, (batch_size, src_len))
    tgt = torch.randint(1, tgt_vocab_size, (batch_size, tgt_len))

    # 前向传播
    output = model(src, tgt)

    print("模型结构已创建并运行前向传播。")
    print(f"输入源形状: {src.shape}")
    print(f"输入目标形状: {tgt.shape}")
    print(f"输出形状: {output.shape}")  # 预期 [2, 12, 100]

输出：

(base) PS E:\Datawhale 2026\base-llm202602> & D:/Users/app/miniconda3/envs/base-llm/python.exe "e:/Datawhale 2026/base-llm202602/04_8_transformer_single.py"

模型结构已创建并运行前向传播。

输入源形状: torch.Size( $2, 10$ )

输入目标形状: torch.Size( $2, 12$ )

输出形状: torch.Size( $2, 12, 100$ )

解读：

成功运行了 Transformer 模型的前向传播，输出形状 [2, 12, 100] 代表：

批次大小 2：一次处理两个句子。
目标序列长度 12：模型生成长度为 12 的输出序列（每个位置对应一个词）。
词表大小 100：每个位置输出 100 个词的得分（logits），后续可通过 softmax 得到概率。

这个结果说明模型结构正确，所有组件（位置编码、多头注意力、编码器/解码器层）都能正常工作。不过这只是前向传播演示，模型还没有经过训练。如果想用这个 Transformer 进行训练（比如做翻译任务），需要添加训练循环、数据加载、损失函数等。

Datawhale 大模型算法全栈基础篇 202602第4次笔记

笔记：

第一节 Seq2Seq 架构

一、什么是 Seq2Seq？

二、编码器和解码器长什么样？

2.1 编码器

2.2 解码器

三、训练时的小技巧：教师强制

四、Seq2Seq 的局限：信息瓶颈

📊 输出结果解读

观察

🔍 为什么会这样？

1. 数据集太小

2. 贪心解码的局限

3. 训练与推理的差异（Exposure Bias）

第二节 注意力机制

一、为什么需要注意力？

二、注意力机制的原理

三、注意力机制的数学抽象：QKV 范式

📊 输出分析

🔍 原因分析

🚀 下一步建议

1. 获取真实数据集（最关键）

📊 结果分析

主要问题

🚀 下一步改进方案

1. 使用全部数据（如果文件完整）

2. 降低模型复杂度

3. 简化数据（去掉标点）

4. 增加训练轮数并监控验证损失

5. 尝试使用更大的批次（如果显存允许）

📊 训练结果分析

1. 训练过程正常

2. 翻译结果为什么这么差？

🔍 为什么会这样？

核心原因：数据预处理与推理不一致

第三节 深入解析 Transformer

一、为什么需要 Transformer？

二、自注意力：让每个词看看所有词

三、多头注意力：从多个角度思考

四、位置编码：给词加上顺序

五、Transformer 的整体结构

六、关键组件

第二节注意力机制

第三节深入解析 Transformer