《生成式 AI》课程作业6 大语言模型（LLM）的训练微调 Fine Tuning -- part1

资料来自李宏毅老师《生成式 AI》课程，如有侵权请通知下线

Introduction to Generative AI 2024 Spring

该文档主要介绍了国立台湾大学（NTU）2024 年春季 "生成式人工智能（GenAI）" 课程的作业 5（GenAI HW5）相关内容，包括任务概述、待办事项、解码参数、提交与评分、参考资料和附录等部分，具体如下：

任务概述
- 目标：让 AI 模型学会根据给定的前两句续写唐诗。
- 示例：展示了 AI 模型能续写和不能续写唐诗的示例。
- 方法：通过收集诗歌数据，教导 AI 模型续写唐诗。
- 生成多样性：同一个前两句，模型可能生成不同的诗歌。
待办事项（TODOs）
- 确定用于微调自有大语言模型（LLM）的训练数据数量。
- 调整解码参数以生成 15 首唐诗。
模型与数据集
- 数据集：唐诗数据集，共 5000 首唐诗，示例代码中使用了其中 1040 首进行微调，鼓励尝试使用更多数据以提升性能。
- 模型：示例代码使用 Taide - 7B 模型（将于 4 月中旬发布，在此之前请勿向课程外人员分享），也可选择其他大语言模型进行微调。
解码参数调整
- 生成原理：语言模型生成文本是通过从下一个词元（token）分布中采样来确定下一个词元。
- 常用参数
  - 温度（temperature）：与输出的多样性相关，取值范围为 0.0 及以上，温度越高多样性越好。
  - Top - k：决定模型在生成每个词时的选择数量，是一个正整数，值越大选择范围越广。
  - Top - p：考虑概率累计达到一定阈值（0.0 到 1.0 之间）的词元进行采样，用于控制生成回复的多样性。
  - max_length：生成的词元序列的最大长度，若生成的句子被截断，可检查此参数。
- 示例代码中的默认参数及调整建议：示例代码中给出了一些解码参数的默认值，并鼓励根据需要调整这些参数以生成更高质量的唐诗。
作业流程及示例代码
- 详细介绍了在 Google Colab 中运行的示例代码，包括导入必要的库、设置随机种子、定义函数、加载模型和数据集、设置微调参数、进行微调以及生成诗歌等步骤，并说明了每个步骤的作用和注意事项，以及部分代码块的预计运行时间。
附录
- 如何激活 GPU：在训练模型前需激活 GPU，以加速训练过程，介绍了在 Google Colab 中激活 GPU 的步骤（点击 "Runtime" 或 "执行阶段"，选择 "Change runtime type" 或 "变更执行阶段类型"，选择 "T4 GPU" 或其他可用 GPU 并保存）。
- 选择适配器（adapter）：可在训练的不同阶段选择 LoRA 适配器，并观察模型输出的变化，示例代码中选择了最后步骤的适配器。
- 友好提醒：本次作业是课程中首次训练模型的尝试，过程中可能会遇到问题，但这些经历有助于学生面对未来更大的挑战。
其他参考资料
- 提供了相关论文 [2106.09685] LoRA: Low - Rank Adaptation of Large Language Models 的引用，以及 Taide 模型和联发科模型的相关信息。

根据作业的规范要求，这是TA 助教提供的代码连接需要魔法 google 账号 colab 账号执行

GenAI_hw5_LLM_finetuning.ipynbhttps://colab.research.google.com/drive/1nB3jwRJVKXSDDNO-pbURrao0N2MpqHl8?usp=sharing#scrollTo=ZVVG_SQrvFpe

以下是google colab的代码, 亲测可执行

想在本地执行代码的可以查看我的另一个博文

《生成式 AI》课程作业6 大语言模型（LLM）的训练微调 Fine Tuning -- part2-CSDN博客文章浏览阅读147次。代码围绕一个主工作目录展开，在这个主工作目录下包含了多个子目录和相关文件，用于存放不同阶段的数据、模型以及输出结果等内容，各个部分分工明确，以支持整个预训练语言模型微调及测试的流程。这段 Python 代码主要实现了基于 Hugging Face Transformers 库对预训练语言模型（具体为模型）进行微调（Fine-tuning）的功能，使其能更好地应用于生成唐诗相关内容的任务。整个流程涵盖了数据加载与预处理、模型配置、模型训练以及训练后模型的测试与结果保存等环节。https://blog.csdn.net/chenchihwen/article/details/144000079?spm=1001.2014.3001.5501

以下是对这段 Python 代码的逐行解释：

环境设置与依赖安装
- from google.colab import drive：导入 Google Colab 的驱动模块，用于挂载 Google Drive。
- drive.mount('/content/drive')：挂载 Google Drive 到指定目录/content/drive。
- 以下部分是安装所需的包：
  - !pip install bitsandbytes==0.43.0等：使用pip命令安装一系列包，包括bitsandbytes、datasets、transformers等，并指定了版本。
数据相关操作
- 数据准备
  - load_dataset和load_from_disk：用于加载数据集，这里在后续代码中用于加载训练数据和进行数据处理。
  - !git clone https://github.com/CheeEn-Yu/GenAI-Hw5.git：从指定的 GitHub 仓库克隆数据集到当前目录。
- 数据处理函数定义
  - generate_training_data函数：用于将数据点（包含指令、输入和输出文本）转换为模型可读取的令牌形式，包括输入令牌、注意力掩码和输出目标。
  - evaluate函数：用于根据输入的指令、生成配置和最大长度，获取模型的输出，并进行解码和打印。
模型相关操作
- 模型选择与下载
  - model_name = "/content/TAIDE-LX-7B-Chat"等：指定要使用的语言模型名称，这里默认是TAIDE-LX-7B-Chat模型，也可以选择MediaTek-Research/Breeze-7B-Instruct-v0_1模型。
  - !wget -O taide_7b.zip "https://www.dropbox.com/scl/fi/harnetdwx2ttq1xt94rin/TAIDE-LX-7B-Chat.zip?rlkey=yzyf5nxztw6farpwyyildx5s3&st=s22mz5ao&dl=0"和!unzip taide_7b.zip：通过wget命令下载模型的压缩包，并解压到当前目录。
- 模型推理前设置
  - 配置模型和令牌器：包括设置缓存目录、量化配置、日志级别等，并创建tokenizer，设置pad_token为eos_token。
  - 设置推理参数：如max_len、generation_config等，用于控制模型的生成行为。
  - 通过示例数据进行推理：使用demo_tang_list中的唐诗句子进行推理，并将结果存储在demo_before_finetune列表中，最后打印和保存到文本文件。
- 微调参数设置
  - 一系列参数的定义和设置，包括训练数据量、输出目录、检查点目录、训练轮数、学习率等。这些参数可以根据需要进行调整，以控制模型的训练过程和性能。
  - 还包括一些与训练过程相关的设置，如缓存目录、是否从检查点加载模型权重、数据加载路径、训练日志输出步骤、模型保存步骤和限制等。
- 模型微调
  - 准备工作：根据from_ckpt标志决定是否从检查点加载模型权重，并对模型进行准备，以使用 INT8 训练。
  - 配置和训练：使用LoraConfig配置 LORA 模型，设置相关参数，并通过Transformers Trainer进行模型训练。训练过程中可以设置各种参数，如批次大小、梯度累积步数、学习率等，并禁用模型的缓存功能。
  - 训练结果保存：训练完成后，将模型保存到指定的目录中，并打印可能的缺失权重警告信息。
测试与结果处理
- 模型加载与测试设置
  - 查找和列出可用的检查点，并根据用户选择的id_of_ckpt_to_use确定要使用的检查点。
  - 设置测试相关的参数，如max_len、temperature、top_p等，并从指定的检查点加载模型权重。
- 测试过程
  - 读取测试数据，对每个测试数据进行预测，并将结果保存到results.txt文件中，同时打印每个预测的结果。
结果提交与比较
- 提交要求 ：强调需要将results.txt文件中的 15 首诗提交到达芬奇助手进行评分。
- 效果对比：使用与微调前相同的示例数据，对微调后的模型进行推理，并与微调前的结果进行对比，以观察模型在经过微调后的性能提升情况。
文件下载与参考资料
- files.download(output_path)：从 Google Drive 下载指定路径的文件output_path。
- [Tang Poem Dataset](https://github.com/chinese-poetry/chinese-poetry/tree/master/%E5%85%A8%E5%94%90%E8%AF%97?fbclid=IwAR2bM14S42T-VtrvMi3wywCqKfYJraBtMl7QVTo0qyPMjX9jj9Vj3JepFBA)：提供了数据集的参考链接。
from google.colab import drive
drive.mount('/content/drive')

"""## Install Packages
We install and import some well-written packages created by others to facilitate the fine-tuning process.

The following code block takes about 5 minutes to run, but it may vary depending on the condition of Colab.
"""

""" It is recommmended NOT to change codes in this cell """

!pip install bitsandbytes==0.43.0
!pip install datasets==2.10.1
!pip install transformers==4.38.2
!pip install peft==0.9.0
!pip install sentencepiece==0.1.99
!pip install -U accelerate==0.28.0
!pip install colorama==0.4.6
!pip install fsspec==2023.9.2

"""The following code block takes about 20 seconds to run, but it may vary depending on the condition of Colab."""

""" It is recommmended NOT to change codes in this cell """

import os
import sys
import argparse
import json
import warnings
import logging
warnings.filterwarnings("ignore")

import torch
import torch.nn as nn
import bitsandbytes as bnb
from datasets import load_dataset, load_from_disk
import transformers, datasets
from peft import PeftModel
from colorama import *

from tqdm import tqdm
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import GenerationConfig
from peft import (
prepare_model_for_int8_training,
LoraConfig,
get_peft_model,
get_peft_model_state_dict,
prepare_model_for_kbit_training
)

"""## Download Dataset for Fine-tuning"""

""" It is recommmended NOT to change codes in this cell """

Download Training dataset

reference:https://github.com/chinese-poetry/chinese-poetry/tree/master/全唐诗?fbclid=IwAR2bM14S42T-VtrvMi3wywCqKfYJraBtMl7QVTo0qyPMjX9jj9Vj3JepFBA

!git clone https://github.com/CheeEn-Yu/GenAI-Hw5.git

"""## Fix Random Seeds
There may be some randomness involved in the fine-tuning process. We fix random seeds to make the result reproducible.
"""

""" It is recommmended NOT to change codes in this cell """

seed = 42
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)

"""## Define Some Useful Functions"""

""" It is recommmended NOT to change codes in this cell """

生成訓練資料

def generate_training_data(data_point):
"""
(1) Goal:
- This function is used to transform a data point (input and output texts) to tokens that our model can read
复制代码
```
 (2) Arguments:
     - data_point: dict, with field "instruction", "input", and "output" which are all str

 (3) Returns:
     - a dict with model's input tokens, attention mask that make our model causal, and corresponding output targets

 (3) Example:
     - If you construct a dict, data_point_1, with field "instruction", "input", and "output" which are all str, you can use the function like this:
         formulate_article(data_point_1)

 """
 # construct full input prompt
 prompt = f"""\
```
[INST] <<SYS>>
You are a helpful assistant and good at writing Tang poem. 你是一個樂於助人的助手且擅長寫唐詩。
<</SYS>>

{data_point["instruction"]}
{data_point["input"]}
[/INST]"""
# count the number of input tokens
len_user_prompt_tokens = (
len(
tokenizer(
prompt,
truncation=True,
max_length=CUTOFF_LEN + 1,
padding="max_length",
)["input_ids"]
) - 1
)
# transform input prompt into tokens
full_tokens = tokenizer(
prompt + " " + data_point["output"] + "",
truncation=True,
max_length=CUTOFF_LEN + 1,
padding="max_length",
)["input_ids"][:-1]
return {
"input_ids": full_tokens,
"labels": [-100] * len_user_prompt_tokens
+ full_tokens[len_user_prompt_tokens:],
"attention_mask": [1] * (len(full_tokens)),
}

進行生成回覆的評估

def evaluate(instruction, generation_config, max_len, input="", verbose=True):
"""
(1) Goal:
- This function is used to get the model's output given input strings
复制代码
```
 (2) Arguments:
     - instruction: str, description of what you want model to do
     - generation_config: transformers.GenerationConfig object, to specify decoding parameters relating to model inference
     - max_len: int, max length of model's output
     - input: str, input string the model needs to solve the instruction, default is "" (no input)
     - verbose: bool, whether to print the mode's output, default is True

 (3) Returns:
     - output: str, the mode's response according to the instruction and the input

 (3) Example:
     - If you the instruction is "ABC" and the input is "DEF" and you want model to give an answer under 128 tokens, you can use the function like this:
         evaluate(instruction="ABC", generation_config=generation_config, max_len=128, input="DEF")

 """
 # construct full input prompt
 prompt = f"""\
```
[INST] <<SYS>>
You are a helpful assistant and good at writing Tang poem. 你是一個樂於助人的助手且擅長寫唐詩。
<</SYS>>

{instruction}
{input}
[/INST]"""
# 將提示文本轉換為模型所需的數字表示形式
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
# 使用模型進行生成回覆
generation_output = model.generate(
input_ids=input_ids,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
max_new_tokens=max_len,
)
# 將生成的回覆解碼並印出
for s in generation_output.sequences:
output = tokenizer.decode(s)
output = output.split("[/INST]")[1].replace("", "").replace("", "").replace("Assistant:", "").replace("Assistant", "").strip()
if (verbose):
print(output)

复制代码

return output

"""## Download model and inference before fine-tuning

The following code block takes about 10 minutes to run if you use the default setting, but it may vary depending on the condition of Colab.
"""

""" You may want (but not necessarily need) to change the LLM model """

model_name = "/content/TAIDE-LX-7B-Chat" # 設定想要用來進行fine-tune的模型，預設是使用TAIDE 7B的模型
#model_name = "MediaTek-Research/Breeze-7B-Instruct-v0_1" # 若想選擇使用MediaTek Breeze 7B的模型，可以將這行最前面的 "#" 刪除，並把底下 "!" 開頭的兩行刪除

If you want to use the TAIDE model, you should check out the TAIDE L Models Community License Agreement (https://drive.google.com/file/d/1FcUZjbUH6jr4xoCyAronN_slLgcdhEUd/view) first.

Once you use it, it means you agree to the terms of the agreement.

!wget -O taide_7b.zip "https://www.dropbox.com/scl/fi/harnetdwx2ttq1xt94rin/TAIDE-LX-7B-Chat.zip?rlkey=yzyf5nxztw6farpwyyildx5s3&st=s22mz5ao&dl=0"

!unzip taide_7b.zip

"""## Inference before Fine-tuning
Let's first see what our model can do without fine-tuning.

The following code block takes about 2 minutes to run if you use the default setting, but it may vary depending on the condition of Colab.
"""

""" It is recommmended NOT to change codes in this cell """

cache_dir = "./cache"

nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)

從指定的模型名稱或路徑載入預訓練的語言模型

model = AutoModelForCausalLM.from_pretrained(
model_name,
cache_dir=cache_dir,
quantization_config=nf4_config,
low_cpu_mem_usage = True
)

創建 tokenizer 並設定結束符號 (eos_token)

logging.getLogger('transformers').setLevel(logging.ERROR)
tokenizer = AutoTokenizer.from_pretrained(
model_name,
add_eos_token=True,
cache_dir=cache_dir,
quantization_config=nf4_config
)
tokenizer.pad_token = tokenizer.eos_token

設定模型推理時需要用到的decoding parameters

max_len = 128
generation_config = GenerationConfig(
do_sample=True,
temperature=0.1,
num_beams=1,
top_p=0.3,
no_repeat_ngram_size=3,
pad_token_id=2,
)

"""The following code block takes about 1 minutes to run if you use the default setting, but it may vary depending on the condition of Colab."""

""" It is recommmended NOT to change codes in this cell """

demo examples

test_tang_list = ['相見時難別亦難，東風無力百花殘。', '重帷深下莫愁堂，臥後清宵細細長。', '芳辰追逸趣，禁苑信多奇。']

get the model output for each examples

demo_before_finetune = []
for tang in test_tang_list:
demo_before_finetune.append(f'模型輸入:\n以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。{tang}\n\n模型輸出:\n'+evaluate('以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。', generation_config, max_len, tang, verbose = False))

print and store the output to text file

for idx in range(len(demo_before_finetune)):
print(f"Example {idx + 1}:")
print(demo_before_finetune[idx])
print("-" * 80)

"""## Set Hyperarameters for Fine-tuning

"""

""" It is highly recommended you try to play around this hyperparameter """

num_train_data = 1040 # 設定用來訓練的資料數量，可設置的最大值為5000。在大部分情況下會希望訓練資料盡量越多越好，這會讓模型看過更多樣化的詩句，進而提升生成品質，但是也會增加訓練的時間
# 使用預設參數(1040): fine-tuning大約需要25分鐘，完整跑完所有cell大約需要50分鐘
# 使用最大值(5000): fine-tuning大約需要100分鐘，完整跑完所有cell大約需要120分鐘

""" You may want (but not necessarily need) to change some of these hyperparameters """

output_dir = "/content/drive/MyDrive" # 設定作業結果輸出目錄 (如果想要把作業結果存在其他目錄底下可以修改這裡，強烈建議存在預設值的子目錄下，也就是Google Drive裡)
ckpt_dir = "./exp1" # 設定model checkpoint儲存目錄 (如果想要將model checkpoints存在其他目錄下可以修改這裡)
num_epoch = 1 # 設定訓練的總Epoch數 (數字越高，訓練越久，若使用免費版的colab需要注意訓練太久可能會斷線)
LEARNING_RATE = 3e-4 # 設定學習率

""" It is recommmended NOT to change codes in this cell """

cache_dir = "./cache" # 設定快取目錄路徑
from_ckpt = False # 是否從checkpoint載入模型的權重，預設為否
ckpt_name = None # 從特定checkpoint載入權重時使用的檔案名稱，預設為無
dataset_dir = "./GenAI-Hw5/Tang_training_data.json" # 設定資料集的目錄或檔案路徑
logging_steps = 20 # 定義訓練過程中每隔多少步驟輸出一次訓練誌
save_steps = 65 # 定義訓練過程中每隔多少步驟保存一次模型
save_total_limit = 3 # 控制最多保留幾個模型checkpoint
report_to = None # 設定上報實驗指標的目標，預設為無
MICRO_BATCH_SIZE = 4 # 定義微批次的大小
BATCH_SIZE = 16 # 定義一個批次的大小
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE # 計算每個微批次累積的梯度步數
CUTOFF_LEN = 256 # 設定文本截斷的最大長度
LORA_R = 8 # 設定LORA（Layer-wise Random Attention）的R值
LORA_ALPHA = 16 # 設定LORA的Alpha值
LORA_DROPOUT = 0.05 # 設定LORA的Dropout率
VAL_SET_SIZE = 0 # 設定驗證集的大小，預設為無
TARGET_MODULES = ["q_proj", "up_proj", "o_proj", "k_proj", "down_proj", "gate_proj", "v_proj"] # 設定目標模組，這些模組的權重將被保存為checkpoint
device_map = "auto" # 設定設備映射，預設為"auto"
world_size = int(os.environ.get("WORLD_SIZE", 1)) # 獲取環境變數"WORLD_SIZE"的值，若未設定則預設為1
ddp = world_size != 1 # 根據world_size判斷是否使用分散式數據處理(DDP)，若world_size為1則不使用DDP
if ddp:
device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)}
GRADIENT_ACCUMULATION_STEPS = GRADIENT_ACCUMULATION_STEPS // world_size

"""## Start Fine-tuning

The following code block takes about 25 minutes to run if you use the default setting, but it may vary depending on the condition of Colab.
"""

""" It is recommmended NOT to change codes in this cell """

create the output directory you specify

os.makedirs(output_dir, exist_ok = True)
os.makedirs(ckpt_dir, exist_ok = True)

根據 from_ckpt 標誌，從 checkpoint 載入模型權重

if from_ckpt:
model = PeftModel.from_pretrained(model, ckpt_name)

將模型準備好以使用 INT8 訓練

model = prepare_model_for_int8_training(model)

使用 LoraConfig 配置 LORA 模型

config = LoraConfig(
r=LORA_R,
lora_alpha=LORA_ALPHA,
target_modules=TARGET_MODULES,
lora_dropout=LORA_DROPOUT,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

將 tokenizer 的 padding token 設定為 0

tokenizer.pad_token_id = 0

載入並處理訓練數據

with open(dataset_dir, "r", encoding = "utf-8") as f:
data_json = json.load(f)
with open("tmp_dataset.json", "w", encoding = "utf-8") as f:
json.dump(data_json[:num_train_data], f, indent = 2, ensure_ascii = False)

data = load_dataset('json', data_files="tmp_dataset.json", download_mode="force_redownload")

將訓練數據分為訓練集和驗證集（若 VAL_SET_SIZE 大於 0）

if VAL_SET_SIZE > 0:
train_val = data["train"].train_test_split(
test_size=VAL_SET_SIZE, shuffle=True, seed=42
)
train_data = train_val["train"].shuffle().map(generate_training_data)
val_data = train_val["test"].shuffle().map(generate_training_data)
else:
train_data = data['train'].shuffle().map(generate_training_data)
val_data = None

使用 Transformers Trainer 進行模型訓練

trainer = transformers.Trainer(
model=model,
train_dataset=train_data,
eval_dataset=val_data,
args=transformers.TrainingArguments(
per_device_train_batch_size=MICRO_BATCH_SIZE,
gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
warmup_steps=50,
num_train_epochs=num_epoch,
learning_rate=LEARNING_RATE,
fp16=True, # 使用混合精度訓練
logging_steps=logging_steps,
save_strategy="steps",
save_steps=save_steps,
output_dir=ckpt_dir,
save_total_limit=save_total_limit,
ddp_find_unused_parameters=False if ddp else None, # 是否使用 DDP，控制梯度更新策略
report_to=report_to,
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

禁用模型的 cache 功能

model.config.use_cache = False

若使用 PyTorch 2.0 版本以上且非 Windows 系統，進行模型編譯

if torch.version >= "2" and sys.platform != 'win32':
model = torch.compile(model)

開始模型訓練

trainer.train()

將訓練完的模型保存到指定的目錄中

model.save_pretrained(ckpt_dir)

印出訓練過程中可能的缺失權重的警告信息

print("\n If there's a warning about missing keys above, please disregard :)")

"""## Testing
The fine-tuning process is done. We then want to test whether our model can do the task that we wanted it to do before but failed.

We need to first load the fine-tuned model for checkpoint we saved.
"""

""" It is recommmended NOT to change codes in this cell """

find all available checkpoints

ckpts = []
for ckpt in os.listdir(ckpt_dir):
if (ckpt.startswith("checkpoint-")):
ckpts.append(ckpt)

list all the checkpoints

ckpts = sorted(ckpts, key = lambda ckpt: int(ckpt.split("-")[-1]))
print("all available checkpoints:")
print(" id: checkpoint name")
for (i, ckpt) in enumerate(ckpts):
print(f"{i:>3}: {ckpt}")

""" You may want (but not necessarily need) to change the check point """

id_of_ckpt_to_use = -1 # 要用來進行推理的checkpoint的id(對應上一個cell的輸出結果)
# 預設值-1指的是上列checkpoints中的"倒數"第一個，也就是最後一個checkpoint
# 如果想要選擇其他checkpoint，可以把-1改成有列出的checkpoint id中的其中一個

ckpt_name = os.path.join(ckpt_dir, ckpts[id_of_ckpt_to_use])

""" You may want (but not necessarily need) to change decoding parameters """

你可以在這裡調整decoding parameter，decoding parameter的詳細解釋請見homework slides

max_len = 128 # 生成回復的最大長度
temperature = 0.1 # 設定生成回覆的隨機度，值越小生成的回覆越穩定
top_p = 0.3 # Top-p (nucleus) 抽樣的機率閾值，用於控制生成回覆的多樣性

top_k = 5 # 調整Top-k值，以增加生成回覆的多樣性和避免生成重複的詞彙

"""The following code block takes about 2 minutes to run if you use the default setting, but it may vary depending on the condition of Colab."""

""" It is recommmended NOT to change codes in this cell """

test_data_path = "GenAI-Hw5/Tang_testing_data.json"
output_path = os.path.join(output_dir, "results.txt")

cache_dir = "./cache" # 設定快取目錄路徑
seed = 42 # 設定隨機種子，用於重現結果
no_repeat_ngram_size = 3 # 設定禁止重複 Ngram 的大小，用於避免生成重複片段

nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)

使用 tokenizer 將模型名稱轉換成模型可讀的數字表示形式

tokenizer = AutoTokenizer.from_pretrained(
model_name,
cache_dir=cache_dir,
quantization_config=nf4_config
)

從預訓練模型載入模型並設定為 8 位整數 (INT8) 模型

model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=nf4_config,
device_map={'': 0}, # 設定使用的設備，此處指定為 GPU 0
cache_dir=cache_dir
)

從指定的 checkpoint 載入模型權重

model = PeftModel.from_pretrained(model, ckpt_name, device_map={'': 0})

"""The following code block takes about 4 minutes to run if you use the default setting, but it may vary depending on the condition of Colab."""

""" It is recommmended NOT to change codes in this cell """

results = []

設定生成配置，包括隨機度、束搜索等相關參數

generation_config = GenerationConfig(
do_sample=True,
temperature=temperature,
num_beams=1,
top_p=top_p,
# top_k=top_k,
no_repeat_ngram_size=no_repeat_ngram_size,
pad_token_id=2
)

讀取測試資料

with open(test_data_path, "r", encoding = "utf-8") as f:
test_datas = json.load(f)

對於每個測試資料進行預測，並存下結果

with open(output_path, "w", encoding = "utf-8") as f:
for (i, test_data) in enumerate(test_datas):
predict = evaluate(test_data["instruction"], generation_config, max_len, test_data["input"], verbose = False)
f.write(f"{i+1}. "+test_data["input"]+predict+"\n")
print(f"{i+1}. "+test_data["input"]+predict)

"""## IMPORTANT: Submit above 15 poems to DaVinci Assistant.
You can find these poems in your "results.txt". The grading of this homework will be based on the evaluation results of DaVinci Assistant on these poems.

See how the fine-tune model do compared to model without fine-tuning

We now check what our model can do on the same examples we saw in the "Inference before Fine-tuning" section above.

The following code block takes about 40 seconds to run if you use the default setting, but it may vary depending on the condition of Colab.
"""

using the same demo examples as before

test_tang_list = ['相見時難別亦難，東風無力百花殘。', '重帷深下莫愁堂，臥後清宵細細長。', '芳辰追逸趣，禁苑信多奇。']

inference our fine-tuned model

demo_after_finetune = []
for tang in test_tang_list:
demo_after_finetune.append(f'模型輸入:\n以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。{tang}\n\n模型輸出:\n'+evaluate('以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。', generation_config, max_len, tang, verbose = False))

print and store the output to text file

for idx in range(len(demo_after_finetune)):
print(f"Example {idx + 1}:")
print(demo_after_finetune[idx])
print("-" * 80)

"""## IMPORTANT: DO NOT submit the above 3 examples to DaVinci Assistant.
This 3 examples are only used for comparing how model peforms before and after fine-tuning

Download Results

You MUST have this file to finish your homework. If your browser does not download it automatically, you can find it in your Google Drive.
"""

from google.colab import files
files.download(output_path)

"""## Reference

Tang Poem Dataset
"""

这是我用这代码在Colab T4 执行的结果

1 {'loss': 3.336, 'grad_norm': 2.30328106880188, 'learning_rate': 0.00011999999999999999, 'epoch': 0.31}

2 {'loss': 1.9994, 'grad_norm': 1.6669790744781494, 'learning_rate': 0.00023999999999999998, 'epoch': 0.62}

3 {'loss': 1.9997, 'grad_norm': 1.8000717163085938, 'learning_rate': 9.999999999999999e-05, 'epoch': 0.92}

4 {'train_runtime': 5351.2028, 'train_samples_per_second': 0.194, 'train_steps_per_second': 0.012, 'train_loss': 2.4060784559983475, 'epoch': 1.0}

5

6 If there's a warning about missing keys above, please disregard :)

7 all available checkpoints:

8 id: checkpoint name

9 0: checkpoint-65

10 Example 1:

11 模型輸入:

12 以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。相見時難別亦難，東風無力百花殘。

13

14 模型輸出:

15 ] 誰言春來花自落，落花落得誰家。

16 --------------------------------------------------------------------------------

17 Example 2:

18 模型輸入:

19 以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。重帷深下莫愁堂，臥後清宵細細長。

20

21 模型輸出:

22 ] 玉牀金床無夢眠，玉筯金牖無事看。

23 --------------------------------------------------------------------------------

24 Example 3:

25 模型輸入:

26 以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。芳辰追逸趣，禁苑信多奇。

27

28 模型輸出:

29

30 --------------------------------------------------------------------------------

31 all available checkpoints:

32 id: checkpoint name

33 0: checkpoint-65

34 1. 雪霽銀妝素，桔高映瓊枝。玉關春色深，金谷花氣濃。

35 2. 夫子何爲者？栖栖一代中。今朝不復見，誰言無道人。

36 3. 飛蓋去芳園，蘭橈遊翠渚。春草初生時，春鳥初鳴聲。

37 4. 條風開獻節，灰律動初陽。春草新綠草，春花新紅花。

38 5. 昨夜星辰昨夜風，畫樓西畔桂堂東。今朝春色今朝酒，誰道花落酒醉中。

39 6. 三日入廚下，洗手作羹湯。一朝出戶去，空留空房落。

40 7. 嵩雲秦樹久離居，雙鯉迢迢一紙書。玉關玉門玉關路，金谷金谷玉谷秋。

41 8. 慨然撫長劒，濟世豈邀名。今朝行軍去，誰言無功名。

42 9. 乘興南遊不戒嚴，九重誰省諫書函。玉門春色滿天開，金谷春草遍地春。

43 10. 猿鳥猶疑畏簡書，風雲常爲護儲胥。玉關無事空自閉，金谷無事還自開。

44 11. 君問歸期未有期，巴山夜雨漲秋池。月落青天空無雲，秋風秋草秋草色。

45 12. 相見時難別亦難，東風無力百花殘。今朝春色如故舊，誰言年華已半盡。

46 13. 雲母屏風燭影深，長河漸落曉星沈。玉樓秋風起，金樓秋雨落。

47 14. 高閣客竟去，小園花亂飛。春來無復見，秋來無容歸。

48 15. 瑤池阿母綺窗開，黃竹歌聲動地哀。玉女舞衣金釵落，玉馬飛雪銀輪開。

49 Example 1:

50 模型輸入:

51 以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。相見時難別亦難，東風無力百花殘。

52

53 模型輸出:

54 春去秋來年復年，誰言離別無來時。

55 --------------------------------------------------------------------------------

56 Example 2:

57 模型輸入:

58 以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。重帷深下莫愁堂，臥後清宵細細長。

59

60 模型輸出:

61 玉牀金床無夢眠，玉筯金筶無事看。

62 --------------------------------------------------------------------------------

63 Example 3:

64 模型輸入:

65 以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。芳辰追逸趣，禁苑信多奇。

66

67 模型輸出:

68 春草初生時，花木始開期。

69 --------------------------------------------------------------------------------

70 Example 1:

71 模型輸入:

72 以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。相見時難別亦難，東風無力百花殘。

73

74 模型輸出:

75 春去秋來年復年，誰言離別無來時。

76 --------------------------------------------------------------------------------

77 Example 2:

78 模型輸入:

79 以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。重帷深下莫愁堂，臥後清宵細細長。

80

81 模型輸出:

82 玉牀金床無夢眠，玉筯金筶無事看。

83 --------------------------------------------------------------------------------

84 Example 3:

85 模型輸入:

86 以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。芳辰追逸趣，禁苑信多奇。

87

88 模型輸出:

89 春草初生時，花木始開期。

90 --------------------------------------------------------------------------------

91 Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

92 Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

93 /content/drive/MyDrive/Takeout

94 /content/drive/MyDrive/Colab Notebooks

《生成式 AI》课程 作业6 大语言模型（LLM）的训练微调 Fine Tuning -- part1

资料来自李宏毅老师《生成式 AI》课程，如有侵权请通知下线

Download Training dataset

reference:https://github.com/chinese-poetry/chinese-poetry/tree/master/全唐诗?fbclid=IwAR2bM14S42T-VtrvMi3wywCqKfYJraBtMl7QVTo0qyPMjX9jj9Vj3JepFBA

生成訓練資料

進行生成回覆的評估

If you want to use the TAIDE model, you should check out the TAIDE L Models Community License Agreement (https://drive.google.com/file/d/1FcUZjbUH6jr4xoCyAronN_slLgcdhEUd/view) first.

Once you use it, it means you agree to the terms of the agreement.

從指定的模型名稱或路徑載入預訓練的語言模型

創建 tokenizer 並設定結束符號 (eos_token)

設定模型推理時需要用到的decoding parameters

demo examples

get the model output for each examples

print and store the output to text file

create the output directory you specify

根據 from_ckpt 標誌，從 checkpoint 載入模型權重

將模型準備好以使用 INT8 訓練

使用 LoraConfig 配置 LORA 模型

將 tokenizer 的 padding token 設定為 0

載入並處理訓練數據

將訓練數據分為訓練集和驗證集（若 VAL_SET_SIZE 大於 0）

使用 Transformers Trainer 進行模型訓練

禁用模型的 cache 功能

若使用 PyTorch 2.0 版本以上且非 Windows 系統，進行模型編譯

開始模型訓練

將訓練完的模型保存到指定的目錄中

印出訓練過程中可能的缺失權重的警告信息

find all available checkpoints

list all the checkpoints

你可以在這裡調整decoding parameter，decoding parameter的詳細解釋請見homework slides

top_k = 5 # 調整Top-k值，以增加生成回覆的多樣性和避免生成重複的詞彙

使用 tokenizer 將模型名稱轉換成模型可讀的數字表示形式

從預訓練模型載入模型並設定為 8 位整數 (INT8) 模型

從指定的 checkpoint 載入模型權重

設定生成配置，包括隨機度、束搜索等相關參數

讀取測試資料

對於每個測試資料進行預測，並存下結果

See how the fine-tune model do compared to model without fine-tuning

using the same demo examples as before

inference our fine-tuned model

print and store the output to text file

Download Results

《生成式 AI》课程作业6 大语言模型（LLM）的训练微调 Fine Tuning -- part1