大模型私人定制：短短几行代码微调构建属于你的人工智能大模型（使用unsloth微调DeepSeek-r1大模型）

前言

DeepSeek， QWQ一系列实力强劲大模型的发布标志着我国在人工智能大模型领域进入世界领导者行列。越来越多领域开始关注并使用大模型。各行各业都积极赋能并开发相关专业领域大模型，比如东南大学推出的"法衡-R1"法律大模型，哈工大推出的"华佗"医疗诊断模型都取得了优异表现。那么它们是如何将大模型这个"博学家"变成领域"专家"的呢？这就要使用大模型的"微调"技术。

上篇文章大模型私人定制：使用llama-factory微调Qwen大模型中介绍了llamafactory webui框架的使用方法，教你不写一行代码构建大模型，今天我们分享使用与llamafactory齐名的大模型微调工具---unsloth, 使用带思维链的COT医疗数据集，短短几行代码微调DeepSeek-R1大模型，大家一起来看看吧~

一、环境搭建

关于大模型微调的概念这里不加赘述，大家感兴趣直接看我以前博文大模型瘦身指南，微调之前unsloth需要支持gpu环境，具体步骤可参照大模型私人定制：使用llama-factory微调Qwen大模型，这里也不加赘述。

我们直接使用anaconda开始搭建Unsloth微调环境：

使用anaconda创建虚拟环境, 指定python版本为3.12, 虚拟环境命名为unsloth_gpu

ini 复制代码

conda create -n unsloth_gpu python=3.12

使用conda activate unsloth_gpu命令激活创建的unsloth_gpu虚拟环境，执行以下命令安装unsloth：

css 复制代码

pip install unsloth
pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
pip install datasets //用来下载数据集的库

安装完成后要检测pytorch和Unsloth是否成功安装,输入下面命令出现图中结果表示安装成功~

python 复制代码

# 检测pytorch gpu版是否成功安装
import torch
torch.__version__
torch.version.cuda
torch.cuda.is_available()

#检测unsloth是否成功安装
from unsloth import FastLanguageModel

有的同学安装后的pytorch是cpu版本的，因为unsloth只支持在gpu环境下微调大模型（llamafactory既支持gpu也支持cpu），我们还需要安装pytorch gpu版本:

perl 复制代码

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

二、 unsloth快速入门

2.1 模型下载

微调模型的第一步要下载模型，这里使用 DeepSeek-R1-Distill-Llama-8B 模型进行训练。我们在modalscope网站上下载模型，首先执行命令pip install modelscope下载 modelscope 工具，然后执行modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B命令，模型就会被下载到当前目录下，执行结果如下:

2.2 unsloth模型推理

我们需要编写代码调用unsloth框架进行大模型的推理（向大模型提问）,模型保存在DeepSeek-R1-Distill-Llama-8B目录下:

从unsloth中导入关键包

javascript 复制代码

from unsloth import FastLanguageModel

设置关键参数

ini 复制代码

max_seq_length = 2048 #模型可以接收和回答的最大长度
dtype = None  #使用模型默认精度，你也可以设置为torch.float16或torch.bfloat16
load_in_4bit = False #是否加载4bit量化模型来减少显存使用

加载预训练模型:

ini 复制代码

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "./DeepSeek-R1-Distill-Llama-8B", #本地保存模型路径
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

我们这里可以通过print(model)观察llama8B的简要模型结构, 它是由32个Transformer解码层组成的。

print(tokenizer)查看分词器，分词器的作用是将我们输入的文本转化为对应的id列表，从而可以被大语言模型进行向量化。举个例子：假设我们有一个词典：

markdown 复制代码

{
    "我们"：0
      "爱"：1
    "可爱"：2
    "但是"：3
    "科学"：4
    ...
}

当用户输入【我们爱科学】这句话，分词器会将其转化为python列表[0,1,4],然后将该列表输入大模型进行向量化,可以看到llama-8b的字典大小为128000,也就是能识别128000个词。

切换模型为推理模式:

scss 复制代码

FastLanguageModel.for_inference(model)

接下来就可以和模型对话了, 输入问题【请问如何证明根号2是无理数？】，输入的问题首先会进入分词器转化为索引：

scss 复制代码

inputs = tokenizer([question], return_tensors="pt").to("cuda")
print(intputs)

将生成的input_ids代入模型生成回答结果:

ini 复制代码

outputs = model.generate(
    input_ids=inputs.input_ids,
    max_new_tokens=1200, #大模型输出中会拼接问题描述，这里max_new_tokens表示除了问题外回答最多可以生成1200个词
    use_cache=True, #使用缓存问同样的问题会得到同样答案
)

此时生成的outputs也是词索引,我们打印print(outputs)看一下:

将outputs转化为文本并输出,可以看到ouputs拼接了提问的问题并给出了回复的结果。

ini 复制代码

response = tokenizer.batch_decode(outputs)
print(response[0])

2.3 unsloth模板化输出

以上是对DeepSeek-R1的简要提问，良好的AI对话代码要有结构化的输入输出，接下来就模拟大家日常提问场景代入进行问答~

结构化输入模板：

ini 复制代码

prompt_style_chat = """请写出一个恰当的回答来完成当前对话任务。

### Instruction:
你是一名助人为乐的助手。

### Question:
{}

### Response:
<think>{}"""

描述问题并嵌入到结构化模板中:

ini 复制代码

question = "请证明根号2是无理数"
inputs = tokenizer([prompt_style_chat.format(question, "")], return_tensors="pt").to("cuda") #python格式化字符串使用question替换第一个{}也就是Question中内容。

输入模型生成回答,可以看到模型按照结构化的方式输出

ini 复制代码

outputs = model.generate(
    input_ids=inputs.input_ids,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response)

从结构化回答中提取标准答案只需要print(response[0].split("### Response:")[1])把我们的结果提取出来~

2.4 原始模型评测

因为unsloth在训练完成后会自动更新模型参数，所以需要在微调前测试原始模型效果：

重新设置问答模板，之后要采用全英文医疗诊断数据集medical-o1-reasoning-SFT，使用英文构建问答模板。

ini 复制代码

prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>{}"""

翻译如下,这部分提示词使用了提示词编写技巧，感兴趣可参考我博文提示词必备要素与基本技巧

shell 复制代码

prompt_style = """
以下是一个任务说明，配有提供更多背景信息的输入。
请写出一个恰当的回答来完成该任务。
在回答之前，请仔细思考问题，并按步骤进行推理，确保回答逻辑清晰且准确。

### Instruction:
您是一位具有高级临床推理、诊断和治疗规划知识的医学专家。
请回答以下医学问题。

### Question:
{}

### Response:
<think>{}

选取问题进行评测, 这里选取的问题如下，翻译:一位61岁的女性，有长期在咳嗽或打喷嚏等活动中发生不自主尿液流失的病史，但夜间没有漏尿。她接受了妇科检查和Q-tip测试。根据这些检查结果，膀胱测量（cystometry）最可能会显示她的残余尿量和逼尿肌收缩情况如何？

css 复制代码

question_1 = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"

问答测试，把问题代入模板向大模型进行提问,并得到回答结果:

ini 复制代码

inputs1 = tokenizer([prompt_style.format(question_1, "")], return_tensors="pt").to("cuda")


outputs1 = model.generate(
    input_ids=inputs1.input_ids,
    max_new_tokens=1200,
    use_cache=True,
)

response1 = tokenizer.batch_decode(outputs1)

print(response1[0].split("### Response:")[1])

原模型答案的翻译: 【根据对患者病史和Q-tip测试结果的分析，膀胱测量最可能显示逼尿肌的收缩正常，残余尿量正常。主要问题似乎是由于内源性括约肌缺陷引起的压力性尿失禁，如Q-tip测试阳性所示。这种情况通常影响尿道括约肌在压力增大时防止漏尿的能力，而不是逼尿肌的收缩能力。因此，逼尿肌的收缩并未过度活跃，残余尿量在正常范围内。】

标准答案是：【在这种压力性尿失禁的情况下，膀胱测压检查（cystometry）最可能显示正常的排尿后残余尿量 ，因为压力性尿失禁通常不会影响膀胱排空功能。此外，由于压力性尿失禁主要与身体用力 有关，而不是膀胱过度活动症（OAB），因此在测试过程中不太可能观察到逼尿肌的非自主收缩。】

很显然目前模型的回答缺失需要显示的关键点。

三、unsloth模型微调

完整代码可见链接codecopy.cn/post/h3lyp6 ，下面我们对代码每部分进行详细讲解：

3.1 数据集下载

模型微调的第一步是构建模型微调数据集。对于DeepSeek-R1这种推理模型，我们在训练时更希望使用COT数据集（带有思维过程的数据集）进行训练。这里我们使用从modelscope上下载的medical-o1-reasoning-SFT数据集，格式如下:

看到数据集每个列表项都带有Question问题, Complex_CoT复杂思维过程和Response回答，相比于非推理模型比如我上篇博文大模型私人定制：使用llama-factory微调Qwen大模型中使用的Qwen模型，它的数据集格式为instruction指示, intput用户输入和output输出，没有思维链的思考过程（instruction和input合起来是Question）

从 modelscope 网站上下载数据集medical_o1_sft.json，使用datasets库读取数据集，这里利用split参数只读取了500条数据（方便快速演示）:

ini 复制代码

from datasets import load_dataset
dataset = load_dataset(path='json', data_files='./medical_o1_sft.json',split = "train[0:500]")
print(dataset[0])

下图是数据集效果,每个数据集对象都由Question, Complex_CoT和Response构成:

3.2 问答模板化构建

首先设置问答模板，相比于推理模板，训练的提示词模板多了{}用来填入我们已知的结果。（训练集结果是已知的）

python 复制代码

train_prompt_style = 
"""
Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response: 
<think> 
{} 
</think>
{}
"""

对下载的medical_o1_sft.json数据进行模板格式化处理，把数据对象Question填入模板Question中，complex_cot填入think中，ouput填入最后{}中。

ini 复制代码

EOS_TOKEN = tokenizer.eos_token # 词表中的结束符, 要在模板最后填入

def formatting_prompts_func(examples): #返回Python字典，字典中的text字段是模板列表
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

dataset = dataset.map(formatting_prompts_func, batched = True,)
print(dataset['text'][0])

从下图可见，每个数据都已经被嵌入格式化模板中相应位置

3.3 开启微调

构建好数据集后开始微调，首先将模型设置为微调模式，每个参数的解释我这里直接注释在相应代码中，可以发现这些参数和我们在大模型私人定制：使用llama-factory微调Qwen大模型 llama-factory web上的LORA参数设置是对应的:

ini 复制代码

model = FastLanguageModel.get_peft_model(
    model,
    r=16, # LORA的秩
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ], # 指定哪些模型层的权重矩阵需要添加LoRA适配器
    lora_alpha=16, # 控制LoRA输出的缩放权重
    lora_dropout=0,  # 在LoRA适配器的前向传播中应用Dropout，防止过拟合
    bias="none",  # 控制是否训练模型中的偏置项
    use_gradient_checkpointing="unsloth",  # 通过牺牲计算时间换取显存优化，允许更大批次更大长度的训练
    random_state=3407,# 固定随机种子，确保实验可复现
    use_rslora=False, # 是否使用rslora, RSLoRA是LoRA的改进变体，可能通过参数初始化或缩放提升稳定性
    loftq_config=None, # 与LoRA的量化训练（如LoftQ初始化）相关，用于减少显存
)

导入相关的库创建有监督微调对象（SFT），FastLanguageModel.get_peft_model设置的是作用于模型参数的优化，trainer是用来设置超参数，也就是包括训练次数，训练轮数这些整体的训练调控。

ini 复制代码

trainer = SFTTrainer(
    model=model, # 设置微调的模型
    tokenizer=tokenizer, # 模型词表
    train_dataset=dataset, # 训练数据集
    dataset_text_field="text", # 训练数据集中的字段，dataset是字典，dataset['text']是列表，每个列表项是模板数据
    max_seq_length=max_seq_length, #模型输入输出最长的文本
    dataset_num_proc=2, #数据并行加载的线程数，提高数据处理速度
    args=TrainingArguments(
        per_device_train_batch_size=2,#每个GPU/设备的训练批量大小（较小值适合大模型）
        gradient_accumulation_steps=4,#梯度累积步数, 这里相当于2*4=8, 每累积8个数据的梯度更新参数
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps=5, # 训练步数中有几步属于预热步数，让模型慢慢预热
        max_steps=60, # 要训练多少步
        learning_rate=2e-4, #学习率
        fp16=not is_bfloat16_supported(), #是否支持fp16
        bf16=is_bfloat16_supported(), #是否支持bf16
        logging_steps=10, # 每10步记录日志
        optim="adamw_8bit", # 优化器采用adam
        weight_decay=0.01, # 权重衰减，防止过拟合
        lr_scheduler_type="linear", # 学习率采用线性调度
        seed=3407, #随机种子，种子相同的实验是可复现的
        output_dir="outputs", #训练结果的输出目录
    ),
)

设置好相关参数，使用如下命令开始微调:

ini 复制代码

trainer_stats = trainer.train()

unsloth在微调结束后，会自动更新模型权重（在缓存中），因此可直接调用微调后的模型。这里使用如下指令检查微调后模型结构:

scss 复制代码

FastLanguageModel.for_inference(model)

可以看到我们指定的module已经融合了我们训练的lora参数

3.4 微调评测

同样使用人工评测的方法来检查微调是否有效, 测试同样的问题，结果为：

ini 复制代码

inputs = tokenizer([prompt_style.format(question_1, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)

print(response[0].split("### Response:")[1])

翻译过来是【根据妇科检查和Q-tip试验结果，膀胱测量最有可能显示女性膀胱残余尿量，容量减少和逼尿肌无力。由于尿道括约肌无力导致压力性尿失禁，这可能是因为膀胱顺应性低】

我们回顾一下标准答案是【在这种压力性尿失禁的情况下，膀胱测压检查（cystometry）最可能显示正常的排尿后残余尿量 ，因为压力性尿失禁通常不会影响膀胱排空功能。此外，由于压力性尿失禁主要与身体用力 有关，而不是膀胱过度活动症（OAB），因此在测试过程中不太可能观察到逼尿肌的非自主收缩。】

可见重要指标已经分析清楚了，不过还有一些分析不太准确，但别忘了我们训练集只采用了500条数据训练，接下来可以增加训练轮数和训练数据：

ini 复制代码

# 数据集构造去掉split参数
dataset = load_dataset(path='json', data_files='./medical_o1_sft.json')

# 增加训练轮数, 这里不指定max__steps具体步数，而是指定num_train_epochs采用全部的数据集训练几轮
trainer = SFTTrainer( 
    model=model, 
    tokenizer=tokenizer, 
    train_dataset=dataset, 
    dataset_text_field="text", 
    max_seq_length=max_seq_length, 
    dataset_num_proc=2, 
    args=TrainingArguments( 
        per_device_train_batch_size=2, 
        gradient_accumulation_steps=4, 
        num_train_epochs = 3, 
        warmup_steps=5, 
        # max_steps=60, 
        learning_rate=2e-4, 
        fp16=not is_bfloat16_supported(), 
        bf16=is_bfloat16_supported(), 
        logging_steps=10, 
        optim="adamw_8bit", 
        weight_decay=0.01, 
        lr_scheduler_type="linear", 
        seed=3407, 
        output_dir="outputs", ), 
)

其它步骤和上面讲解类似就不再赘述，训练完成后在问题的解答上效果有了明显提升。（使用全部数据训练3轮在4090上大约15h）

3.5 模型合并

outputs文件夹中只保存了微调的Lora参数，下一步要把原始参数和Lora参数合并到一起，unsloth合并参数的操作也很简单，只需执行如下代码即可：

ini 复制代码

new_model_local = "DeepSeek-R1-Medical-COT-Tiny"
model.save_pretrained(new_model_local) 
tokenizer.save_pretrained(new_model_local)

model.save_pretrained_merged(new_model_local, tokenizer, save_method = "merged_16bit",)

四、总结

以上分享了使用框架unsloth在写少量代码情况下微调大模型的实践。相比于之前介绍过的大模型私人定制：使用llama-factory微调Qwen大模型, unsloth虽然需要编写一些代码，但它结构更清晰可控。同时Unsloth的训练性能、支持模型种类要优于llamafactory。

当然unsloth和llamafactory都是很优秀的框架，大家可以凭自己喜好选择微调框架使用。还等什么，快来动手训练属于你"专有领域"的大模型吧~

大家在阅读微调分享经常评论到

有没有什么微调经验
微调参数详细讲解设置的分享
微调数据集如何构建

这些内容笔者都会出分享教程，感兴趣可以关注wx公众号：大模型真好玩，免费的工作实践中大模型开发使用经验和资料分享~