Unsloth 本地微调 llama3 8b 中文微调

硬件需求

Nvidia GPU（建议 8GB 以上显存）
当前 UP 使用的设备
- Nvidia RTX 3060 12G
- P106 6G

环境搭建

Python 版本：3.11.0
CUDA 工具链：NVCC (Cuda Toolkit) Build cuda_12.8.r12.8/compiler.35404655_0

安装 PyTorch

在安装过程中踩了很多坑，实际上 Conda 自带封装好的 PyTorch 镜像，可以直接使用以下命令安装：

sh 复制代码

conda install conda-forge::pytorch

这将从 Conda 拉取已经配置好的 PyTorch 环境。

安装 Unsloth

存储要求

确保存储 Unsloth 的磁盘大于 120GB 且读写速度较快。
例如，在 D:\ai 目录下新建 unsloth 文件夹。

拉取 Unsloth 代码

sh 复制代码

git clone https://github.com/unslothai/unsloth.git

代码会被拉取到 D:\ai\unsloth。

安装依赖

在 VSCode 终端（PowerShell）中，确保使用 Conda 的 Python 3.11.0 + conda-forge::pytorch 环境，然后执行：

sh 复制代码

cd D:\ai\unsloth
pip install unsloth

安装完成后，基本环境搭建完成。

测试安装

可以运行以下测试脚本，判断安装是否成功。

python 复制代码

import multiprocessing
import warnings
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

warnings.filterwarnings("ignore", message="expandable_segments not supported on this platform")

if __name__ == '__main__':
    multiprocessing.freeze_support()
    
    max_seq_length = 2048
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Llama-3.2-1B",
        max_seq_length=max_seq_length,
        load_in_4bit=True,
    )

    dataset = load_dataset("json", data_files={"train": "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"}, split="train")

    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        tokenizer=tokenizer,
        args=SFTConfig(
            dataset_text_field="text",
            max_seq_length=max_seq_length,
            per_device_train_batch_size=2,
            gradient_accumulation_steps=4,
            warmup_steps=10,
            max_steps=1,
            logging_steps=1,
            output_dir="outputs",
            optim="adamw_8bit",
            seed=3407,
        ),
    )

    trainer.train()

此脚本会从 Hugging Face 服务器拉取一个小模型并训练一步，用于验证环境配置是否正确。

准备训练

创建存储目录

在一个 大容量、高速读写 的磁盘上新建 model 或 ai 目录，例如：

sh 复制代码

mkdir D:\ai\model

拉取 LLaMA3 8B 中文微调模型

sh 复制代码

git clone https://www.modelscope.cn/LLM-Research/Meta-Llama-3-8B-Instruct.git

注意：实测在这里下载的 LLaMA3 8B 的中文能力有限，在深度垂直领域仍以英文输出为主。

自定义数据微调

为了提升中文对话能力，整合 GPT-4o 多轮对话数据集

url 复制代码

https://huggingface.co/datasets/IndexG/A_small_of_gpt4o_sessions_for_indexguc

训练代码

python 复制代码

import torch
from unsloth import FastLanguageModel
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer
import warnings





# 忽略 expandable_segments 不支持的警告
warnings.filterwarnings("ignore", message="expandable_segments not supported on this platform")

if __name__ == '__main__':
    # 加载数据集

    dataset = load_dataset("json", data_files="./converted_data.json", split="train")

    # 在模型加载后定义格式化函数
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="H:/ai/Meta-Llama-3-8B-Instruct",
        max_seq_length=2048,
        dtype=torch.float16,
        load_in_4bit=True
    )

        # 将格式化函数定义在tokenizer之后，形成闭包
    def formatting_prompts_func(examples):
        # 统一处理单样本和批处理模式
        conv_text = []
        
        # 自动检测数据结构（兼容新旧版本数据集加载方式）
        if isinstance(examples["conversations"][0], list):
            conversations_list = examples["conversations"]
        else:
            conversations_list = [examples["conversations"]]
        
        for conversation in conversations_list:
            text = ""
            for turn in conversation:
                if turn["role"] == "user":
                    text += "<|user|>\n{content}</s>\n".format(content=turn["content"])
                elif turn["role"] == "assistant":
                    text += "<|assistant|>\n{content}</s>\n".format(content=turn["content"])
            conv_text.append(text)
        
        return conv_text




    # 设置 LoRA 训练参数
    model = FastLanguageModel.get_peft_model(
        model,
        r=16,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        lora_alpha=16,
        lora_dropout=0,
        bias="none",
        use_gradient_checkpointing="unsloth",
        random_state=3407,
        use_rslora=False,
        loftq_config=None
    )

    # 训练超参数配置
    training_args = TrainingArguments(
        output_dir="models/lora/llama",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=50,  # 训练步数
        logging_steps=10,
        save_strategy="steps",
        save_steps=10, # 改成每 10 步保存一次
        learning_rate=2e-4,
        fp16=True,  # 设置为 True，与模型的 float16 精度一致
        bf16=False,  # 设置为 False，避免冲突
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407
    )


    
    # 开始训练
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        args=training_args,
        train_dataset=dataset,
        max_seq_length=2048,
        dataset_num_proc=2,
        packing=False,
        formatting_func=formatting_prompts_func,  # 替换为新的格式化函数
    )

    # 打印信息
    print(f"本次将训练 {len(dataset)} 条数据。\n训练次数: {training_args.max_steps},模型路径为 {training_args.output_dir}.")


    # 显示当前内存状态
    gpu_stats = torch.cuda.get_device_properties(0)
    start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

    print(f"GPU 型号 = {gpu_stats.name}。最大显存 = {max_memory} GB。")
    print(f"已预留显存 = {start_gpu_memory} GB。")

    # 执行训练
    trainer_stats = trainer.train()

    # 显示最终内存和时间统计数据
    used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
    used_percentage = round(used_memory / max_memory * 100, 3)
    lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

    print(f"训练用时 = {trainer_stats.metrics['train_runtime']} 秒。")
    print(f"训练用时 = {round(trainer_stats.metrics['train_runtime'] / 60, 2)} 分钟。")
    print(f"峰值显存使用 = {used_memory} GB。")
    print(f"训练期间峰值显存使用 = {used_memory_for_lora} GB。")
    print(f"峰值显存使用占最大显存比例 = {used_percentage} %。")
    print(f"训练期间峰值显存使用占最大显存比例 = {lora_percentage} %。")





    # 保存 LoRA 模型
    lora_model = "models/llama_lora"
    model.save_pretrained(lora_model)
    tokenizer.save_pretrained(lora_model)


    print("模型已保存到", lora_model)


    # 保存完整模型
    model.save_pretrained_merged("models/Llama3", tokenizer, save_method="merged_16bit")

    # 保存为 GGUF 格式
    # TODO 这里有问题就暂时不用了
    """
    如果想转模型为F16以调用
    可以执行命令
    dir位于当前目录下的llama.cpp文件夹下
    python d:/ai/unsloth/llama.cpp/convert_hf_to_gguf.py D:/ai/unsloth/models/Llama3/  --outfile H:/ai/Meta-Llama-3-8B-InstructGGUF/llama3-8b-instruct.f16.gguf --outtype f16

    生成原模型 和 lora后的做对比
    python d:/ai/unsloth/llama.cpp/convert_hf_to_gguf.py H:/ai/Meta-Llama-3-8B-Instruct  --outfile H:/ai/Meta-Llama-3-8B-InstructO1/llama3-8b-instruct.f16.gguf --outtype f16
    
    """
    # try:
    #     model.save_pretrained_gguf("model", tokenizer, quantization_method="f16")
    # except RuntimeError as e:
    #     print("遇到错误：", e)
    #     print("需要手动编译 llama.cpp 进行转换操作。")

训练结果

在 100 步 时，模型出现 过拟合 和 语言混用。
降低至 50 步 后，模型表现良好。

转换模型格式

sh 复制代码

python d:/ai/unsloth/llama.cpp/convert_hf_to_gguf.py D:/ai/unsloth/models/Llama3/ --outfile H:/ai/Meta-Llama-3-8B-InstructGGUF/llama3-8b-instruct.f16.gguf --outtype f16

总结：

成功完成了 LLaMA3 8B 的 LoRA 微调。
经过 50 步训练，模型在多轮中文对话任务上的表现显著提高。