大模型微调实战：LlamaFactory + Ollama + SpringAI 全流程指南

前言

大语言模型（LLM）的定制化需求日益增长，但全量微调 对硬件门槛高、工具链碎片化 等问题让许多开发者望而却步。LLaMA-Factory 的出现极大降低了这一门槛------它是一款轻量化微调框架，将流程简化为"数据准备→配置调整→一键训练"三步。本文将手把手带您走通 LLaMA-Factory 环境搭建、模型下载、数据集准备、LoRA 微调、导出 Ollama 部署、最终通过 Spring AI 接入的全链路。

阅读前请确保：Python ≥ 3.9、PyTorch 2.x、有 CUDA 或 Apple Silicon（M1/M2/M3）环境，或至少 8GB 显存。

一、LLaMA-Factory 安装与核心特性

1.1 核心特性速览

特性	说明
模型支持	LLaMA、Qwen、ChatGLM、DeepSeek、Baichuan、Mistral 等上百种模型
训练方式	全参微调、LoRA、QLoRA、Freeze、DPO、PPO、预训练
数据格式	Alpaca 格式（指令微调）、ShareGPT 格式（多轮对话）
Web UI	基于 Gradio，零代码操作

1.2 两种安装方式

方式一：Docker（适合快速上手）

bash 复制代码

docker run -it --gpus all -p 7860:7860 ghcr.io/hiyouga/llama-factory:latest

启动后访问 http://localhost:7860 即可进入 Web UI。

方式二：源码安装（推荐，适合深度定制）

bash 复制代码

# 1. 克隆项目
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory

# 2. 创建虚拟环境（推荐 Python 3.10）
conda create -n llama_factory python=3.10 -y
conda activate llama_factory

# 3. 安装依赖（包含 Web UI 依赖）
pip install -e ".[torch,metrics]"
pip install -r requirements.txt

# 4. 验证 GPU 是否可用
python -c "import torch; print(torch.cuda.is_available())"
# 输出 True 表示成功

若需强化学习或 LoRA 量化，额外安装 trl、bitsandbytes 等。

1.3 环境避坑指南

常见错误	解决方案
Python 版本不兼容	使用 Python 3.8--3.10 兼容性最佳
`llamafactory-cli` 未找到	在项目根目录执行 `pip install -e .` 重新安装
ModuleNotFoundError	按错误信息补全缺失模块，检查 CUDA 与 PyTorch 兼容性

二、模型下载与配置

2.1 模型下载方式

LLaMA-Factory 支持 Hugging Face 和 ModelScope（魔搭社区） 两种下载源。考虑到国内网络，推荐 ModelScope。

方式一：ModelScope 命令行下载（推荐）

bash 复制代码

# 以 Qwen2.5-7B-Instruct 为例
git clone https://www.modelscope.cn/Qwen/Qwen2.5-7B-Instruct.git

下载完成后，将文件夹放置在 LLaMA-Factory/models/ 下，便于管理。

方式二：Hugging Face 下载（需代理或梯子）

bash 复制代码

git lfs install
git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct

💡 小技巧：使用 ModelScope 时，设置环境变量可自动切换下载源：
bash 复制代码
export USE_MODELSCOPE_HUB=1  # Linux/Mac
set USE_MODELSCOPE_HUB=1     # Windows
在 dataset_info.json 中可直接填写 ModelScope 仓库地址。

三、数据集准备

3.1 LLaMA-Factory 支持的数据格式

LLaMA-Factory 主要支持 Alpaca 和 ShareGPT 两种格式。

Alpaca 格式（单轮指令微调，最常用） ：

字段	类型	说明
`instruction`	必需	任务指令
`input`	可选	输入数据
`output`	必需	期望输出
`history`	可选	历史对话（用于多轮）

json 复制代码

[
  {
    "instruction": "请总结以下文字的核心观点",
    "input": "人工智能正在改变世界...",
    "output": "核心观点是人工智能带来变革性影响。"
  },
  {
    "instruction": "只剩一个心脏了还能活吗？",
    "output": "能，人本来就只有一个心脏。"
  }
]

若无 input 字段则直接对 instruction 进行回复。单轮对话省略 history 即可。

ShareGPT 格式（多轮对话） ：

json 复制代码

{
  "conversations": [
    {"from": "human", "value": "你好，你是谁？"},
    {"from": "gpt", "value": "我是AI助手，很高兴为您服务。"}
  ]
}

3.2 自定义数据集注册步骤

放置数据文件 ：将 JSON 文件放入 LLaMA-Factory/data/ 目录。
修改 dataset_info.json：

json 复制代码

// data/dataset_info.json 中添加
"my_dataset": {
  "file_name": "my_data.json",
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output"
  }
}

预训练数据集仅需 columns: { "prompt": "text" }；偏好数据集需设置 "ranking": true 并包含 chosen / rejected 字段。

内置数据集（如 alpaca_zh）位于同一目录，可直接跳过该步骤，用于快速验证。

四、模型微调（LoRA）

本章以 Qwen2.5-7B-Instruct 和 弱智吧数据集 为例。

4.1 启动 Web UI

bash 复制代码

llamafactory-cli webui

或直接在项目目录执行 python src/webui.py。访问 http://localhost:7860。

4.2 Web UI 参数配置

配置板块	参数设置	推荐值/操作
模型	模型名称	`Qwen2.5-7B-Instruct`
模型	模型路径	`models/Qwen2.5-7B-Instruct`
模型	微调方法	`lora`（显著降低计算与存储成本）
模型	对话模板	`qwen`
数据	数据集	选择 `my_dataset`（自定义）或平台预置数据集
训练	训练轮数	3--10
训练	学习率	`1e-4`（LoRA 常用）
训练	计算精度	`bf16`（高性能）/ `fp16`（性价比）
训练	批量大小	2--4（取决于显存）
训练	梯度累积	2
LoRA	LoRA 模块	`all`

配置完成后，点击 "开始" 按钮启动训练。过程中可实时观察 loss 曲线------当 loss 趋于平滑时表示训练充分。

五、模型导出与 Ollama 部署

LLaMA-Factory 导出的模型是 Hugging Face 格式 ，Ollama 原生支持 GGUF 格式，需进行格式转换。

5.1 从 LLaMA-Factory 导出模型

Web UI 中点击 "Export" 选项卡。
设置参数：
- Max shard size：建议 2--5 GB
- Export dir：指定导出目录
点击 "Export" 按钮，LLaMA-Factory 会自动在导出目录生成 Modelfile 文件。

5.2 格式转换：Hugging Face → GGUF

工具选择 ：使用 llama.cpp 的 convert_hf_to_gguf.py。
操作步骤 ：
1. 从 GitHub 下载 llama.cpp 仓库 git clone https://github.com/ggerganov/llama.cpp.git
2. 执行转换命令：

bash 复制代码

python llama.cpp/convert_hf_to_gguf.py /path/to/exported_model --outtype f16 --outfile model.gguf

转换后的 .gguf 文件可直接用于 Ollama 部署。

5.3 Ollama 部署

步骤一：安装 Ollama

bash 复制代码

# Linux/macOS 可使用官方脚本（Windows 请访问官网下载）
curl -fsSL https://ollama.com/install.sh | sh

步骤二：创建 Modelfile（LLaMA-Factory 导出时已自动生成，可直接使用）

若需手动创建参考模板：

dockerfile 复制代码

FROM ./model.gguf
TEMPLATE """{{ if .System }}<|im_start|>system {{ .System }}<|im_end|>{{ end }}
{{ if .Prompt }}<|im_start|>user {{ .Prompt }}<|im_end|>{{ end }}
<|im_start|>assistant"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9

详细模板配置请参考官方文档。

步骤三：创建并运行模型

bash 复制代码

ollama create my-custom-model -f Modelfile   # 创建模型
ollama run my-custom-model                  # 交互式运行

验证模型列表：ollama list。

常用 Ollama 命令一览：

命令	说明
`ollama list`	列出所有模型
`ollama run <模型名>`	交互式运行
`ollama ps`	列出正在运行的模型
`ollama rm <模型名>`	删除模型
`ollama pull <模型名>`	从注册表拉取模型
`ollama cp <源> <目标>`	复制模型

六、Spring AI 接入微调模型

本章介绍如何通过 Spring AI 框架将 Ollama 部署的微调模型接入 Java 后端应用。

6.1 项目环境要求

JDK 17 或更高版本（Spring Boot 3.x）
Ollama 服务已启动（默认端口 11434）
已完成微调模型的 ollama create

6.2 Maven 依赖配置

xml 复制代码

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-ollama-spring-boot-starter</artifactId>
    <version>1.0.0-M6</version>
</dependency>

若需 Web API 功能，同时引入 spring-boot-starter-web。

6.3 application.yml 配置

yaml 复制代码

spring:
  ai:
    ollama:
      base-url: http://localhost:11434  # Ollama 服务地址
      chat:
        options:
          model: my-custom-model         # 替换为你的模型名称
          temperature: 0.7

连接失败时检查 Ollama 服务是否在运行（systemctl status ollama）。

6.4 Controller 代码示例

java 复制代码

import org.springframework.ai.chat.client.ChatClient;
import org.springframework.web.bind.annotation.*;

@RestController
@RequestMapping("/ai")
public class ChatController {

    private final ChatClient chatClient;

    public ChatController(ChatClient.Builder chatClientBuilder) {
        this.chatClient = chatClientBuilder.build();
    }

    @GetMapping("/chat")
    public String chat(@RequestParam String message) {
        return chatClient.prompt()
                .user(message)
                .call()
                .content();
    }
}

调用方式：curl http://localhost:8080/ai/chat?message=你好。

七、端到端实战案例

7.1 业务需求：意图识别模型微调

假设业务场景是构建客服意图识别模型，训练目标是将用户输入分类为"咨询"、"投诉"、"建议"等类别。

数据准备（Alpaca 格式）：

json 复制代码

[
  {
    "instruction": "将用户消息分类为[咨询, 投诉, 建议]中的一类",
    "input": "我的订单怎么还没发货？",
    "output": "咨询"
  },
  {
    "instruction": "将用户消息分类为[咨询, 投诉, 建议]中的一类",
    "input": "你们的服务太差了",
    "output": "投诉"
  }
]

后续流程：

将数据注册到 dataset_info.json → 进入 Web UI 配置 LoRA 微调（学习率 2e-4，3 轮训练） → 导出模型 → GGUF 转换 → Ollama 部署 → Spring AI 接入。

7.2 完整架构流程图

#mermaid-svg-5rbiCCmY2kWA1hpu{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}@keyframes edge-animation-frame{from{stroke-dashoffset:0;}}@keyframes dash{to{stroke-dashoffset:0;}}#mermaid-svg-5rbiCCmY2kWA1hpu .edge-animation-slow{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 50s linear infinite;stroke-linecap:round;}#mermaid-svg-5rbiCCmY2kWA1hpu .edge-animation-fast{stroke-dasharray:9,5!important;stroke-dashoffset:900;animation:dash 20s linear infinite;stroke-linecap:round;}#mermaid-svg-5rbiCCmY2kWA1hpu .error-icon{fill:#552222;}#mermaid-svg-5rbiCCmY2kWA1hpu .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-5rbiCCmY2kWA1hpu .edge-thickness-normal{stroke-width:1px;}#mermaid-svg-5rbiCCmY2kWA1hpu .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-5rbiCCmY2kWA1hpu .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-5rbiCCmY2kWA1hpu .edge-thickness-invisible{stroke-width:0;fill:none;}#mermaid-svg-5rbiCCmY2kWA1hpu .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-5rbiCCmY2kWA1hpu .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-5rbiCCmY2kWA1hpu .marker{fill:#333333;stroke:#333333;}#mermaid-svg-5rbiCCmY2kWA1hpu .marker.cross{stroke:#333333;}#mermaid-svg-5rbiCCmY2kWA1hpu svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-5rbiCCmY2kWA1hpu p{margin:0;}#mermaid-svg-5rbiCCmY2kWA1hpu .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-5rbiCCmY2kWA1hpu .cluster-label text{fill:#333;}#mermaid-svg-5rbiCCmY2kWA1hpu .cluster-label span{color:#333;}#mermaid-svg-5rbiCCmY2kWA1hpu .cluster-label span p{background-color:transparent;}#mermaid-svg-5rbiCCmY2kWA1hpu .label text,#mermaid-svg-5rbiCCmY2kWA1hpu span{fill:#333;color:#333;}#mermaid-svg-5rbiCCmY2kWA1hpu .node rect,#mermaid-svg-5rbiCCmY2kWA1hpu .node circle,#mermaid-svg-5rbiCCmY2kWA1hpu .node ellipse,#mermaid-svg-5rbiCCmY2kWA1hpu .node polygon,#mermaid-svg-5rbiCCmY2kWA1hpu .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-5rbiCCmY2kWA1hpu .rough-node .label text,#mermaid-svg-5rbiCCmY2kWA1hpu .node .label text,#mermaid-svg-5rbiCCmY2kWA1hpu .image-shape .label,#mermaid-svg-5rbiCCmY2kWA1hpu .icon-shape .label{text-anchor:middle;}#mermaid-svg-5rbiCCmY2kWA1hpu .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-svg-5rbiCCmY2kWA1hpu .rough-node .label,#mermaid-svg-5rbiCCmY2kWA1hpu .node .label,#mermaid-svg-5rbiCCmY2kWA1hpu .image-shape .label,#mermaid-svg-5rbiCCmY2kWA1hpu .icon-shape .label{text-align:center;}#mermaid-svg-5rbiCCmY2kWA1hpu .node.clickable{cursor:pointer;}#mermaid-svg-5rbiCCmY2kWA1hpu .root .anchor path{fill:#333333!important;stroke-width:0;stroke:#333333;}#mermaid-svg-5rbiCCmY2kWA1hpu .arrowheadPath{fill:#333333;}#mermaid-svg-5rbiCCmY2kWA1hpu .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-5rbiCCmY2kWA1hpu .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-5rbiCCmY2kWA1hpu .edgeLabel{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-5rbiCCmY2kWA1hpu .edgeLabel p{background-color:rgba(232,232,232, 0.8);}#mermaid-svg-5rbiCCmY2kWA1hpu .edgeLabel rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-5rbiCCmY2kWA1hpu .labelBkg{background-color:rgba(232, 232, 232, 0.5);}#mermaid-svg-5rbiCCmY2kWA1hpu .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-5rbiCCmY2kWA1hpu .cluster text{fill:#333;}#mermaid-svg-5rbiCCmY2kWA1hpu .cluster span{color:#333;}#mermaid-svg-5rbiCCmY2kWA1hpu div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-5rbiCCmY2kWA1hpu .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#333;}#mermaid-svg-5rbiCCmY2kWA1hpu rect.text{fill:none;stroke-width:0;}#mermaid-svg-5rbiCCmY2kWA1hpu .icon-shape,#mermaid-svg-5rbiCCmY2kWA1hpu .image-shape{background-color:rgba(232,232,232, 0.8);text-align:center;}#mermaid-svg-5rbiCCmY2kWA1hpu .icon-shape p,#mermaid-svg-5rbiCCmY2kWA1hpu .image-shape p{background-color:rgba(232,232,232, 0.8);padding:2px;}#mermaid-svg-5rbiCCmY2kWA1hpu .icon-shape .label rect,#mermaid-svg-5rbiCCmY2kWA1hpu .image-shape .label rect{opacity:0.5;background-color:rgba(232,232,232, 0.8);fill:rgba(232,232,232, 0.8);}#mermaid-svg-5rbiCCmY2kWA1hpu .label-icon{display:inline-block;height:1em;overflow:visible;vertical-align:-0.125em;}#mermaid-svg-5rbiCCmY2kWA1hpu .node .label-icon path{fill:currentColor;stroke:revert;stroke-width:revert;}#mermaid-svg-5rbiCCmY2kWA1hpu :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 应用接入
转换部署
微调阶段
准备阶段
环境安装

Python 3.9+
下载基座模型

ModelScope/HuggingFace
准备数据集

Alpaca/ShareGPT
LLaMA-Factory WebUI

配置LoRA参数
执行训练

监控loss曲线
导出模型

Hugging Face格式
llama.cpp

转换GGUF
Ollama导入

创建Modelfile
Ollama run

验证对话
Spring AI

配置Ollama
开发REST API
业务系统调用

八、性能对比与资源消耗参考

模型参数量	训练方式	单卡显存需求	微调时长参考	推荐 GPU
7B	LoRA + QLoRA	8--16 GB	1--2 小时	RTX 3090 / 4090
14B	QLoRA + ZeRO-3	24--32 GB	2--4 小时	A100 40GB
72B	LoRA + DeepSpeed	80 GB×多卡	4 小时+	H800×4 以上

72B 模型微调时长约 4 小时 11 分钟（H800×8），评估原生模型耗时约 16 分钟。中小项目建议从 7B 起步。

九、常见问题汇总

问题	解决方案
🧊 显存不足 OOM	启用 QLoRA（4/8 bit 量化）、减小批次大小、开启梯度累积
🔌 Ollama 连接 Spring AI 失败	检查 `base-url` 和端口 11434，确保服务在运行
🔢 训练时 loss 不下降	增大学习率、检查数据质量、延长训练轮数
🔠 中文化效果差	使用中文预训练模型（Qwen、ChatGLM），提供中文标注数据

十、总结与选型建议

LLaMA-Factory 的优势在于"将复杂留给框架，将简单交给用户"，核心价值包括：

硬件友好：支持 LoRA / QLoRA 降低显存门槛；
开箱即用：Web UI 实现零代码微调；
生态兼容：导出格式覆盖 Ollama、vLLM、Hugging Face TGI；
易集成：通过 Spring AI 快速接入 Java 后端。

通过本文的 安装、下载、数据准备、微调、导出、Ollama 部署、Spring AI 接入 七个步骤，您已具备独立完成一条龙微调的能力。按需选择模型参数量，从 7B 起步逐步优化，即可构建满足业务场景的定制化 AI 应用。

下一步建议：

尝试 QLoRA 以进一步降低显存占用；

引入 DeepSpeed ZeRO 应对更大参数模型；

使用 vLLM 替代 Ollama 以实现高性能生产级推理。

大模型微调实战：LlamaFactory + Ollama + SpringAI 全流程指南