第七章 指令微调学习(五)Extracting and saving responses

第七章 指令微调学习(五)

7.7 Extracting and saving responses

在对指令数据集的训练部分完成LLM的微调后,现在评估其在保留测试集 上的性能。首先,我们提取测试集中每个输入对应的模型生成响应并进行人工分析 ;随后通过图7.18所示方法对LLM进行评估,以量化响应的质量。

1.测试集指令响应

为完成响应指令步骤,我们使用 generate 函数。随后,我们将模型的响应结果与前三个测试集条目对应的预期测试集答案并排输出,以便进行对比:

python 复制代码
torch.manual_seed(123)
for entry in test_data[:3]:
    input_text = format_input(entry)
    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)

    response_text = (
        generated_text[len(input_text):]
        .replace("### Response:", "")
        .strip()
    )
    print(input_text)
    print(f"\nCorrect response:\n>> {entry['output']}")
    print(f"\nModel response:\n>> {response_text.strip()}")
    print(print("-------------------------------------"))

如前所述,generate函数会返回输入文本与输出文本的组合结果;因此,我们通过对generated_text内容进行切片处理并使用 .replace() 方法来提取模型的响应。
结果:

bash 复制代码
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Rewrite the sentence using a simile.

### Input:
The car is very fast.

Correct response:
>> The car is as fast as lightning.

Model response:
>> The car is as fast as a cheetah.
-------------------------------------
None
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What type of cloud is typically associated with thunderstorms?

Correct response:
>> The type of cloud typically associated with thunderstorms is cumulonimbus.

Model response:
>> The type of cloud associated with thunderstorms is a cumulus cloud.
>### Instruction:
Name the author of 'Pride and Prejudice'.

Correct response:
>> Jane Austen.

Model response:
>> The author of 'Pride and Prejudice' is Jane Austen.
-------------------------------------

从结果可以看出,该模型表现相对良好 。首条和末条指令的答案明显正确,而第二条答案虽接近正确但并不完全准确------模型选择了"积云"而非"积雨云"。不过需要指出的是,积云确实可能发展为积雨云,而积雨云具备引发雷暴的能力。
1.最重要的是,模型评估并不像完成度微调那样简单直接:在完成度微调中,我们只需计算正确分类(垃圾邮件/非垃圾邮件)标签的比例即可得出分类准确率。

2.模型评估

在实际应用中,经过指令微调的大语言模型(LLM),会通过多种方法进行评估

(1)简答题多项选择题 基准测试(例如衡量大规模多任务语言理解能力的 MMLU ;https://arxiv.org/abs/2009.03300),用于评估模型的通用知识水平;

(2)人类对其他大语言模型(LLM)的偏好比较 ,如 LMSYS 聊天机器人竞赛平台(https://arena.lmsys.org);

(3)自动化对话基准测试 ,其中使用GPT-4等大语言模型来评估对话响应质量,例如AlpacaEval(https://tatsu-lab.github.io/alpaca_eval/)。

在实际应用中,综合考虑三种评估方法会更为有效:多项选择题作答人工评估 以及衡量对话表现的自动化指标 。然而,由于我们的主要关注点在于评估对话表现本身,而非单纯考察回答多项选择题的能力,因此人工评估和自动化指标可能更具参考价值。但人工评估耗时所以使用自动化评估

3.自动化评估

让我们采用一种受AlpacaEval启发的方法,使用另一个大语言模型来评估我们微调后的模型响应。不过,与依赖公开基准数据集不同,我们采用了自定义测试集 。这种定制化设计使得我们能够更精准、相关地评估模型在目标应用场景(即我们的指令数据集中所体现的场景)下的性能表现。为准备本次评估所需的响应数据,我们将生成的模型响应追加到test_set字典中,并将更新后的数据保存为"instruction-data-with-response.json"文件以供记录。此外,通过保存该文件,可以加载并分析这些响应。

以下代码清单沿用之前的generate方法;但此次我们遍历了整个test_set集合。同时,我们不再直接打印模型响应,而是将其添加到test_set字典中。最后输出字典中的一个条目查看是否正确添加。

python 复制代码
from tqdm import tqdm

for i, entry in tqdm(enumerate(test_data), total=len(test_data)):
    input_text = format_input(entry)
    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)

    response_text = (
        generated_text[len(input_text):]
        .replace("### Response:", "")
        .strip()
    )
    test_data[i]["model_response"] = response_text
with open("instruction-data-with-response.json", "w") as file:
    json.dump(test_data, file, indent=4)
print(test_data[0])

结果:

最后,保存模型:

python 复制代码
import re
file_name = f"{re.sub(r'[ ()]', '', CHOOSE_MODEL) }-sft.pth"
torch.save(model.state_dict(), file_name)
print(f"Model saved as {file_name}")

输出:
总结:

完整代码如下:

python 复制代码
#Insturction_fine-tuning_pretrained_LLM_5_20
import json

import torch

from pre_training import calc_loss_loader
from Download_instruction_dataset5_9 import train_loader, val_loader
from Training_an_LLM_3_16 import train_model_simple
from load_pretrained_model5_20 import val_data, test_data, CHOOSE_MODEL
from load_pretrained_model5_20 import model, generate, text_to_token_ids, token_ids_to_text, BASE_CONFIG
import tiktoken

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
torch.manual_seed(123)

with torch.no_grad():
    train_loss = calc_loss_loader(
        train_loader, model, device, num_batches=5
    )
    val_loss = calc_loss_loader(val_loader, model, device,num_batches=5)
print("Training loss:", train_loss)
print("Validation loss:", val_loss)

def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = (
        f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    )
    return instruction_text + input_text


import time
start_time = time.time()
torch.manual_seed(123)
optimizer = torch.optim.AdamW(
 model.parameters(), lr=0.00005, weight_decay=0.1
)
num_epochs = 2
tokenizer = tiktoken.get_encoding("gpt2")
train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
    start_context=format_input(val_data[0]), tokenizer=tokenizer
)
end_time = time.time()
execution_time_minutes = (end_time - start_time) / 60
print(f"Training completed in {execution_time_minutes:.2f} minutes.")

import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
def plot_losses(epochs_seen, tokens_seen, train_losses, val_losses):
    fig, ax1 = plt.subplots(figsize=(5, 3))
    ax1.plot(epochs_seen, train_losses, label="Training loss")
    ax1.plot(
    epochs_seen, val_losses, linestyle="-.", label="Validation loss"
    )
    ax1.set_xlabel("Epochs")
    ax1.set_ylabel("Loss")
    ax1.legend(loc="upper right")
    ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
    ax2 = ax1.twiny()
    ax2.plot(tokens_seen, train_losses, alpha=0)
    ax2.set_xlabel("Tokens seen")
    fig.tight_layout()
    plt.show()
epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
plot_losses(epochs_tensor, tokens_seen, train_losses, val_losses)

#5.22
torch.manual_seed(123)
for entry in test_data[:3]:
    input_text = format_input(entry)
    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)

    response_text = (
        generated_text[len(input_text):]
        .replace("### Response:", "")
        .strip()
    )
    print(input_text)
    print(f"\nCorrect response:\n>> {entry['output']}")
    print(f"\nModel response:\n>> {response_text.strip()}")
    print(print("-------------------------------------"))

from tqdm import tqdm

for i, entry in tqdm(enumerate(test_data), total=len(test_data)):
    input_text = format_input(entry)
    token_ids = generate(
        model=model,
        idx=text_to_token_ids(input_text, tokenizer).to(device),
        max_new_tokens=256,
        context_size=BASE_CONFIG["context_length"],
        eos_id=50256
    )
    generated_text = token_ids_to_text(token_ids, tokenizer)

    response_text = (
        generated_text[len(input_text):]
        .replace("### Response:", "")
        .strip()
    )
    test_data[i]["model_response"] = response_text
with open("instruction-data-with-response.json", "w") as file:
    json.dump(test_data, file, indent=4)
print(test_data[0])


import re
file_name = f"{re.sub(r'[ ()]', '', CHOOSE_MODEL) }-sft.pth"
torch.save(model.state_dict(), file_name)
print(f"Model saved as {file_name}")

完成了(1)生成测试集的响应;(2)并进行人工分析;(3)自动化评估。

相关推荐
这是谁的博客?5 小时前
AI 领域精选新闻(2026-05-21)
人工智能·gpt·ai·google·大模型·gemini·新闻
Bruce_Liuxiaowei7 小时前
WorkBuddy案例——自动化内容创作平台
人工智能·ai·大模型·智能体·workbuddy
刘大猫.9 小时前
GPT-5.5才发三周,5.6已在内测!OpenAI与Anthropic补贴大战同日开打,开发者坐收渔利
人工智能·ai·chatgpt·机器人·大模型·openai·anthropic
weixin_5536544810 小时前
ChatGPT好用还是Gemini好用?
人工智能·chatgpt·大模型
DogDaoDao11 小时前
【GitHub】AgentMemory 深度解析:让 AI 编程代理拥有持久化记忆的 16K+ Star 开源方案
人工智能·开源·大模型·github·aigc·ai编程·aiagent
佳杰云星1 天前
如何给大模型集群选“大脑”?智算调度与管理平台 10 维选型指南(附选型评分表)
人工智能·kubernetes·大模型·云计算·gpu·算力调度·智算中心
牧子川1 天前
016-Function-Calling
大模型·tools·functioncalling
这是谁的博客?1 天前
[模型解析] Kimi: 模型架构与长上下文能力分析
ai·大模型·kimi·长上下文·月之暗面·国产ai
这是谁的博客?1 天前
[模型解析] GPT: 模型演进分析从GPT-3到GPT-5.5
gpt·ai·chatgpt·大模型·gpt-3·openai