llamafactory使用8张昇腾910b算力卡lora微调训练qwen2-72b大模型

说明

我需要在昇腾服务器上对Qwen2-72B大模型进行lora微调,改变其自我认知。

我的环境下是8张910B1卡。显存约512GB。

准备:安装llamafactory

请参考官方方法安装llamafactory:https://github.com/hiyouga/LLaMA-Factory

特别强调下,deepspeed一定要按照文档中要求的版本安装,太新或者太旧,多卡训练都有问题。

准备数据集和训练配置

我准备的工作目录如下:

shell 复制代码
./
├── data
│   ├── dataset_info.json
│   └── self_cognition.json
├── deepspeed
│   └── ds_z3_config.json
├── models
├── start_train.sh
└── train_config.yaml

其中data目录下self_cognition.json是我准备的数据集,他是alpaca格式的,dataset_info.json是数据集的入口配置文件,一会训练要指定它。

dataset_info.json内容如下:

json 复制代码
{
"train_data_name": {
    "file_name": "self_cognition.json"
  }
 }

我在这里只写了一个数据集,其实可以配置很多的。

deepspeed目录下的ds_z3_config.json是deepspeed的配置文件,使用多卡训练必须要有这个文件。

这个文件在LLaMA-Factory代码工程下examples/deepspeed下有参考文件,我直接复制了一个过来。

其内容如下:

json 复制代码
{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

models目录是空的,用来存放训练生成的模型。

train_config.yaml是训练配置文件,起内容如下:

yaml 复制代码
### model
model_name_or_path: /data/xxx/mindformer_share/Qwen2-72B-Instruct/

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all

### ddp
ddp_timeout: 180000000
deepspeed: ./deepspeed/ds_z3_config.json

### dataset
dataset: train_data_name
template: qwen
cutoff_len: 1024
max_samples: 200
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: ./models/
logging_steps: 10
save_steps: 50
plot_loss: true
overwrite_output_dir: true
  #report_to: tensorboard
  #logging_dir: /data/xxx/mindformer_share/llamaFactory/tensorboard/

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 40.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

其中model_name_or_path用爱指定基础模型的路径,output_dir用来指定训练生成模型的输出路径。num_train_epochs是训练的轮数,max_samples是最大样本的数量,可根据实际情况修改这个值。save_steps是多少步保存一次中间checkpoint。

重点要说下deepspeed配置,如果不指定deepspeed配置文件,则默认使用数据并行,一旦模型无法在单张卡上加载就会出错。而配置了deepspeed之后,则模型会被切分到各张卡上,大模型可以平均分布在多张卡上。

我的训练启动脚本是start_train.sh,其内容如下:

shell 复制代码
#!/bin/sh

source /usr/local/Ascend/ascend-toolkit/set_env.sh

set -x
#export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

llamafactory-cli train ./train_config.yaml

执行训练

在命令行执行start_train.sh脚本

shell 复制代码
sh start_train.sh

训练完成后./models目录下文件如下:

shell 复制代码
./models/
├── adapter_config.json
├── adapter_model.safetensors
├── added_tokens.json
├── all_results.json
├── checkpoint-100
├── checkpoint-150
├── checkpoint-200
├── checkpoint-50
├── eval_results.json
├── merges.txt
├── README.md
├── runs
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
├── trainer_log.jsonl
├── trainer_state.json
├── training_args.bin
├── training_loss.png
├── train_results.json
└── vocab.json

我的模型训练了40轮,每50步保存一次模型。通过查看trainer_log.jsonl,发现checkpoint-150之后loss就已经很稳定了,我决定就使用checkpoint-150这个中间结果做后面的操作。

json 复制代码
{"current_steps": 10, "total_steps": 200, "loss": 4.1278, "lr": 4e-05, "epoch": 2.0, "percentage": 5.0, "elapsed_time": "0:03:03", "remaining_time": "0:58:11"}
{"current_steps": 20, "total_steps": 200, "loss": 2.4438, "lr": 9e-05, "epoch": 4.0, "percentage": 10.0, "elapsed_time": "0:05:48", "remaining_time": "0:52:20"}
{"current_steps": 30, "total_steps": 200, "loss": 1.0016, "lr": 9.951340343707852e-05, "epoch": 6.0, "percentage": 15.0, "elapsed_time": "0:08:35", "remaining_time": "0:48:39"}
{"current_steps": 40, "total_steps": 200, "loss": 0.4434, "lr": 9.755282581475769e-05, "epoch": 8.0, "percentage": 20.0, "elapsed_time": "0:11:18", "remaining_time": "0:45:13"}
{"current_steps": 50, "total_steps": 200, "loss": 0.0837, "lr": 9.414737964294636e-05, "epoch": 10.0, "percentage": 25.0, "elapsed_time": "0:13:59", "remaining_time": "0:41:58"}
{"current_steps": 60, "total_steps": 200, "loss": 0.0096, "lr": 8.940053768033609e-05, "epoch": 12.0, "percentage": 30.0, "elapsed_time": "0:17:36", "remaining_time": "0:41:06"}
{"current_steps": 70, "total_steps": 200, "loss": 0.0059, "lr": 8.345653031794292e-05, "epoch": 14.0, "percentage": 35.0, "elapsed_time": "0:20:22", "remaining_time": "0:37:50"}
{"current_steps": 80, "total_steps": 200, "loss": 0.0019, "lr": 7.649596321166024e-05, "epoch": 16.0, "percentage": 40.0, "elapsed_time": "0:23:07", "remaining_time": "0:34:41"}
{"current_steps": 90, "total_steps": 200, "loss": 0.0026, "lr": 6.873032967079561e-05, "epoch": 18.0, "percentage": 45.0, "elapsed_time": "0:25:51", "remaining_time": "0:31:36"}
{"current_steps": 100, "total_steps": 200, "loss": 0.0011, "lr": 6.0395584540887963e-05, "epoch": 20.0, "percentage": 50.0, "elapsed_time": "0:28:36", "remaining_time": "0:28:36"}
{"current_steps": 110, "total_steps": 200, "loss": 0.0007, "lr": 5.174497483512506e-05, "epoch": 22.0, "percentage": 55.0, "elapsed_time": "0:32:03", "remaining_time": "0:26:13"}
{"current_steps": 120, "total_steps": 200, "loss": 0.001, "lr": 4.3041344951996746e-05, "epoch": 24.0, "percentage": 60.0, "elapsed_time": "0:34:44", "remaining_time": "0:23:09"}
{"current_steps": 130, "total_steps": 200, "loss": 0.0009, "lr": 3.4549150281252636e-05, "epoch": 26.0, "percentage": 65.0, "elapsed_time": "0:37:25", "remaining_time": "0:20:08"}
{"current_steps": 140, "total_steps": 200, "loss": 0.0008, "lr": 2.6526421860705473e-05, "epoch": 28.0, "percentage": 70.0, "elapsed_time": "0:40:08", "remaining_time": "0:17:12"}
{"current_steps": 150, "total_steps": 200, "loss": 0.0009, "lr": 1.9216926233717085e-05, "epoch": 30.0, "percentage": 75.0, "elapsed_time": "0:42:48", "remaining_time": "0:14:16"}
{"current_steps": 160, "total_steps": 200, "loss": 0.0009, "lr": 1.2842758726130283e-05, "epoch": 32.0, "percentage": 80.0, "elapsed_time": "0:46:37", "remaining_time": "0:11:39"}
{"current_steps": 170, "total_steps": 200, "loss": 0.0007, "lr": 7.597595192178702e-06, "epoch": 34.0, "percentage": 85.0, "elapsed_time": "0:49:21", "remaining_time": "0:08:42"}
{"current_steps": 180, "total_steps": 200, "loss": 0.0007, "lr": 3.6408072716606346e-06, "epoch": 36.0, "percentage": 90.0, "elapsed_time": "0:52:05", "remaining_time": "0:05:47"}
{"current_steps": 190, "total_steps": 200, "loss": 0.0007, "lr": 1.0926199633097157e-06, "epoch": 38.0, "percentage": 95.0, "elapsed_time": "0:54:53", "remaining_time": "0:02:53"}
{"current_steps": 200, "total_steps": 200, "loss": 0.0007, "lr": 3.04586490452119e-08, "epoch": 40.0, "percentage": 100.0, "elapsed_time": "0:57:36", "remaining_time": "0:00:00"}
{"current_steps": 200, "total_steps": 200, "epoch": 40.0, "percentage": 100.0, "elapsed_time": "0:58:26", "remaining_time": "0:00:00"}

合并lora模型到基础模型

merge方法1:llamafactory-cli export

llamafactory的命令行工具自带了lora合并功能,参考源码工程目录examples/merge_lora/下的配置文件编写一个合并的配置文件即可。

首先编写一个合并用的配置文件qwen_merge_lora_config.yaml:

yaml 复制代码
### Note: DO NOT use quantized model or quantization_bit when merging lora adapters

### model
model_name_or_path: /data/xxx/mindformer_share/Qwen2-72B-Instruct/
adapter_name_or_path: /data/xxx/mindformer_share/llamaFactory/models/checkpoint-150/
template: qwen
finetuning_type: lora

### export
export_dir: /data/xxx/mindformer_share/llamaFactory/export_merge_lora/
export_size: 2
export_device: cpu  # 也可以写npu
export_legacy_format: false

上面文件中,model_name_or_path是基础模型路径,adapter_name_or_path是lora训练输出路径,export_dir是合并后输出的模型路径。template时模型的架构,跟lora训练配置里一致即可。

然后在命令行窗口执行:

shell 复制代码
llamafactory-cli export qwen_merge_lora_config.yaml

执行完毕,在export_dir定义路径下就是合并后的完整模型。

merge方法2:

也可以使用python,通过调用peft来合并lora模型。

python 复制代码
import torch
import torch_npu
from torch_npu.npu import amp
from torch_npu.contrib import transfer_to_npu
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
from peft import PeftModel


model_path="/data/xxx/mindformer_share/Qwen2-72B-Instruct/"
lora_path="/data/xxx/mindformer_share/llamaFactory/models/checkpoint-150/"
merge_path="./export_python_merge"


print(f"Loading the Base model from {model_path}")
tokenizer = AutoTokenizer.from_pretrained(model_path,
    revision="v2.0",
    use_fast=False,
    trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(model_path,
    revision="v2.0",
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True)
    #trust_remote_code=True).eval().half().npu()

print(f"Loading the LoRA from {lora_path}")
lora_model = PeftModel.from_pretrained(
        base_model,
        lora_path,
        torch_dtype=torch.float16,
    )


print("Applying the LoRA")
model = lora_model.merge_and_unload()

print(f"Saving the target model to {merge_path}")
model.save_pretrained(merge_path)
print(f"Saving the tokenizer to {merge_path}")
tokenizer.save_pretrained(merge_path)

上述代码中model_path是基础模型路径,lora_path是lora模型目录,merge_path是合并模型输出路径。

2种合并方法的比较

其中方法1中,export_device设置为cpu时,内存至少占140G(大约时模型尺寸大小),cpu会占用到100个核,NPU资源几乎不占。

方法1中export_device设置为npu时整体效果和方法2差不多。

2种方法合并模型的时间都约15分钟。

测试合并后的模型

合并后的模型就和huggingface上下载的模型使用同样的方法测试。这是我测试qwen2-72b合并后的模型的代码。

python 复制代码
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "npu" # the device to load the model onto

#model_path="/data/yuanll/mindformer_share/Qwen2-72B-Instruct/"
#model_path="./export_merge_lora"
model_path="./export_python_merge"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

#{"role": "system", "content": "You are a helpful assistant."},
prompt = "告诉我你的身份和创造者"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"response:{response}")

参考资料

在昇腾开发环境合并baichuan2-13B模型的lora文件

Qwen/Qwen2-72B-Instruct

LLaMA-Factory 实战(一):采用 LoRA 方式对QWen2 做指令微调

LLaMA-Factory 8卡4090 deepspeed zero3 微调Qwen14B-chat

相关推荐
慧都小项13 小时前
自动化UI测试工具TestComplete的AI双引擎:即时数据集 + 自愈测试
自动化测试·测试工具·llm·数据驱动测试·hipaa标准
AI大模型15 小时前
大厂LLM应用岗上岸面经:面28家拿offer,拆解“必问考点+避坑指南”
程序员·llm·agent
没用的阿星15 小时前
阿里发布Qwen3-Coder,效果比肩claude 4!
llm
阿星AI工作室15 小时前
扣子开源本地部署教程 丨Coze智能体小白喂饭级指南
llm·agent·产品
小小小小小鹿16 小时前
Ai入门-搭建一个专属的ai学习助手
llm·ai编程
r0ad17 小时前
四大主流AI Agent框架选型梳理
llm·agent
智泊AI18 小时前
GPU并行计算是什么?GPU并行计算的原理是什么?
llm
yaocheng的ai分身19 小时前
主流大模型的Cache机制对比
llm
数据智能老司机21 小时前
构建由 LLM 驱动的 Neo4j 应用程序——揭开 RAG 的神秘面纱
langchain·llm·aigc
数据智能老司机21 小时前
构建由 LLM 驱动的 Neo4j 应用程序——构建智能应用的知识图谱基础理解
langchain·llm·aigc