基于自己数据微调LLama3并本地化部署

准备数据

这是一条数据，格式如下：

复制代码

{"instruction": "胡女士出现了黄疸、恶心、呕吐等症状，可能患了什么疾病？", "input": "", "output": "胡女士可能患有胆道张力低下综合征。建议尽快到内科进行检查，并进行西医和中医治疗。"}

可以基于自己数据处理为以下格式，也可以下载别人处理好的格式，数据集处理和数据集下载可以看看（github，huggingface），也可以使用ChatGPT生成。

之后在huggingface上传数据集如下

准备微调

https://github.com/unslothai/unsloth

这是一个开源免费的微调库，使用unsloth微调大模型，8G显存即可微调大模型，推理速度提升四倍，内存减少80%，在经过llama.cpp可以量化为4bit，不仅GPU，CPU也可本地推理。

在线微调

使用colab可以白嫖云端服务器进行微调本地大模型

我们来微调Llama3，点击第一个，colab使用教程

大体与官方教程一致，简单改了一下

https://colab.research.google.com/drive/1TqdSFm7qEv8Dh4c04LGV7z3vFjVLqxhR?usp=drive_link

根据新旧GPU安装依赖包

python 复制代码

#1安装微调库
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
# 由于Colab有torch 2.2.1，会破坏软件包，要单独安装
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    # 新GPU，如Ampere、Hopper GPU（RTX 30xx、RTX 40xx、A100、H100、L40）。
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    # 较旧的GPU（V100、Tesla T4、RTX 20xx）
    !pip install --no-deps trl peft accelerate bitsandbytes
    !pip install xformers==0.0.25  #最新的0.0.26不兼容
pass

选择大模型基于微调（名字替换即可）

https://huggingface.co/unsloth

python 复制代码

#2加载模型
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

加载数据集

python 复制代码

#4准备微调数据集
EOS_TOKEN = tokenizer.eos_token # 必须添加 EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # 必须添加EOS_TOKEN，否则无限生成
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("TianFeng668/Llama3-med-data", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

设置训练参数（基于lora微调，r为秩，一般越高越好，取决显存大小，学习率看loss适当调整，训练步数60可以适当加大）

python 复制代码

#5设置训练参数
from trl import SFTTrainer
from transformers import TrainingArguments
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, #  建议 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth", # 检查点，长上下文度
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # 可以让短序列的训练速度提高5倍。
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,  # 微调步数
        learning_rate = 2e-4, # 学习率
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

开始训练

复制代码

#6开始训练
trainer_stats = trainer.train()

保存lora

python 复制代码

#8保存LoRA模型
model.save_pretrained("lora_model") # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # 在线保存到hugging face，需要token

llama.cpp合并模型并量化成4位gguf保存，分三步：

Unsloth: $0$ 安装 llama.cpp。这将需要 3 分钟...

Unsloth： $1$ 将 model 处的模型转换为 f16 GGUF 格式。输出位置将为 ./model-unsloth.F16.gguf

这将需要 3 分钟...

Unsloth：转换完成！输出位置：./model-unsloth.F16.gguf

Unsloth： $2$ 将 GGUF 16bit 转换为 q4_k_m。这将需要 20 分钟...

python 复制代码

#9合并模型并量化成4位gguf保存
model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
#model.save_pretrained_merged("outputs", tokenizer, save_method = "merged_16bit",) #合并模型，保存为16位hf
#model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "") #合并4位gguf，上传到hugging face(需要账号token)

挂载谷歌云盘

python 复制代码

#10挂载google drive
from google.colab import drive
drive.mount('/content/drive')

复制模型到google drive（Llama3是云盘创建的文件夹）

python 复制代码

#11复制模型到google drive
import shutil
source_file = '/content/model-unsloth.Q4_K_M.gguf'
destination_dir = '/content/drive/MyDrive/Llama3'
destination_file = f'{destination_dir}/model-unsloth.Q4_K_M.gguf'
shutil.copy(source_file, destination_file)

本地可视化部署

详看上篇文章

open-webui+ollama本地部署Llama3

选一种就可以，最简单就下载GPT4all，把模型放入C:\Users\admin\AppData\Local\nomic.ai\GPT4All

本地Windows训练

复制代码

Windows本地部署条件
1、Windows10/Windows11
2、英伟达卡8G显存、16G内存，安装CUDA12.1、cuDNN8.9，C盘剩余空间20GB、unsloth安装盘40GB
3、依赖软件：CUDA12.1+cuDNN8.9（或cuda11.8+8.9）、Python11.9、Git、Visual Studio 2022、llvm(可选）
4、HuggingFace账号，上传训练数据集
5、所需文件网盘：链接: https://pan.baidu.com/s/1AhpOX9MnEjpj88jzvIwUmA 提取码: vhea 
VisualStudioSetup（勾选c++桌面开发） git安装  重启

可以使用anaconda管理包和环境，安装https://tianfeng.space/399.html#23（cuda安装，cudnn安装）

python 复制代码

# 创建虚拟环境 激活虚拟环境
conda create --name unsloth_env python=3.11
conda activate unsloth_env
# 安装依赖包
pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
pip install xformers
# Windows下triton，deepspeed需要编译，网盘下载在unsloth目录下，切换到unsloth目录
pip install deepspeed-0.13.1+unknown-py3-none-any.whl
pip install triton-2.1.0-cp311-cp311-win_amd64.whl
# 测试安装是否成功（正常返回信息）
nvcc --version
python -m xformers.info
python -m bitsandbytes
# 运行脚本
test-unlora.py 测试微调之前推理
fine-tuning.py 用数据集微调
test-lora.py 测试微调之后推理
save-16bit.py 合并保存模型16位
save-gguf-4bit.py 4位量化gguf格式
# 若本地运行fine-tuning.py出错，出现gcc.exe无法编译，可以尝试下载llvm-windows-x64.zip解压，在系统环境变量path路径里添加llvm下的bin路径

量化

fine-tuning.py修改注释，之前是保存为16位hf格式，现在量化4位gguf

python 复制代码

4位量化需要安装llama.cpp，步骤如下：
#1、克隆到本地（切换到unsloth目录）
git clone https://github.com/ggerganov/llama.cpp
#2、按官方文档编译
cd llama.cpp
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
#3、设置Visual Studio 2022中cmake路径到系统环境变量path里
C:\Program Files\Microsoft Visual Studio\2022\Professional\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin
C:\Program Files\Microsoft Visual Studio\2022\Professional
#4、编译llama.cpp
cmake --build . --config Release
#5、如果上面这句编译命令无法执行，需要做以下操作：
复制这个路径下的C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\extras\visual_studio_integration\MSBuildExtensions
4个文件，粘贴到以下目录里
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\BuildCustomizations
#6、编译好以后，把llama.cpp\build\bing\release目录下的所有文件复制到llama.cpp目录下
#7、重新运行fine-tuning.py微调保存