欺诈文本分类检测（十八）：基于llama.cpp+CPU推理

1. 前言

前文我们用Lora训练出自己的个性化模型后，首先面临的问题是：如何让模型在普通机器上跑起来？毕竟模型微调时都是在几十G的专用GPU上训练的，如果换到只有CPU的普通电脑上，可能会面临几秒蹦一个词的尴尬问题。

LLama.cpp项目就是来解决这个问题的，它是由Georgi Gerganov开发的一个开源工具，主要用于将大语言模型（LLM）转换为C++代码，使它们可以在任意的CPU设备上运行。

它的优势在于：

无需依赖pytorch和python，而是以c++编译的可执行文件来运行。
支持丰富的硬件设备，包括Nvidia、AMD、Intel、Apple Silicon、华为昇腾等芯片。
支持f16和f32混合精度，也支持8位、4位甚至1位的量化来加快推理。
无需GPU，可只用CPU运行，甚至可以在Android设备上运行。

本文我们将用llama.cpp来运行之前微调过的欺诈文本分类模型。

2. 安装

我们使用本地编译的方式来安装llama.cpp，克隆仓库源代码，并进入llama.cpp目录：

python 复制代码

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

用make命令执行编译：
编译完会多出很多工具文件：

安装所有依赖库：

python 复制代码

pip install -r requirements.txt

3. 模型文件转换

我们微调后的模型由两部分组成：基座模型和Lora适配器，需要对这两者分别转换，最后再合并。

3.1 基座模型转换

先用convert_hf_to_gguf.py工具转换基座模型：

注：convert_hf_to_gguf.py是llama.cpp提供的工具脚本，位于安装目录下，用于将huggingface上下载的safetensors模型格式转换为gguf文件。

python 复制代码

!python /data2/downloads/llama.cpp/convert_hf_to_gguf.py \
    --outtype bf16 \
    --outfile /data2/anti_fraud/models/anti_fraud_v11/qwen2_bf16.gguf \
    --model-name qwen2 \
    /data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct

参数释义：

outtype: 用于指定参数的输出精度，bf16表示16位半精度浮点数；

outfile: 指定输出的模型文件；

model-name: 模型名称；

python 复制代码

INFO:hf-to-gguf:Loading model: Qwen2-1___5B-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:token_embd.weight,         torch.bfloat16 --> BF16, shape = {1536, 151936}
INFO:hf-to-gguf:blk.0.attn_norm.weight,    torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.ffn_down.weight,     torch.bfloat16 --> BF16, shape = {8960, 1536}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,     torch.bfloat16 --> BF16, shape = {1536, 8960}
INFO:hf-to-gguf:blk.0.ffn_up.weight,       torch.bfloat16 --> BF16, shape = {1536, 8960}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.attn_k.bias,         torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.0.attn_k.weight,       torch.bfloat16 --> BF16, shape = {1536, 256}
INFO:hf-to-gguf:blk.0.attn_output.weight,  torch.bfloat16 --> BF16, shape = {1536, 1536}
INFO:hf-to-gguf:blk.0.attn_q.bias,         torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.attn_q.weight,       torch.bfloat16 --> BF16, shape = {1536, 1536}
INFO:hf-to-gguf:blk.0.attn_v.bias,         torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.0.attn_v.weight,       torch.bfloat16 --> BF16, shape = {1536, 256}
......
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 32768
INFO:hf-to-gguf:gguf: embedding length = 1536
INFO:hf-to-gguf:gguf: feed forward length = 8960
INFO:hf-to-gguf:gguf: head count = 12
INFO:hf-to-gguf:gguf: key-value head count = 2
INFO:hf-to-gguf:gguf: rope theta = 1000000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-06
INFO:hf-to-gguf:gguf: file type = 32
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Adding 151387 merge(s).
INFO:gguf.vocab:Setting special token type eos to 151645
INFO:gguf.vocab:Setting special token type pad to 151643
INFO:gguf.vocab:Setting special token type bos to 151643
INFO:gguf.vocab:Setting chat_template to {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful assistant.<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/data2/anti_fraud/models/anti_fraud_v11/qwen2_bf16.gguf: n_tensors = 338, total_size = 3.1G
Writing: 100%|██████████████████████████| 3.09G/3.09G [00:39<00:00, 77.9Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /data2/anti_fraud/models/anti_fraud_v11/qwen2_bf16.gguf

执行完后，就得到一个基座模型的gguf文件qwen2_bf16.gguf。

3.2 lora适配器转换

接下来使用convert_lora_to_gguf.py 脚本工具来转换lora适配器。

python 复制代码

!python /data2/downloads/llama.cpp/convert_lora_to_gguf.py \
    --base /data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct \
    --outfile /data2/anti_fraud/models/anti_fraud_v11/lora_0913_4_bf16.gguf \
    /data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0913_4/checkpoint-5454 \
    --outtype bf16 --verbose

base: 用于指定基座模型位置，目的是确保转换后的Lora适配器可以正确地与基座模型合并；

outfile: 用于指定适配器转换后的输出文件；

outtype: 指定转换格式，与基座模型相同，都用bf16；

checkpoint-5454是要转换的lora适配器的目录位置。

python 复制代码

INFO:lora-to-gguf:Loading base model: Qwen2-1___5B-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:lora-to-gguf:Exporting model...
INFO:hf-to-gguf:blk.0.ffn_down.weight.lora_a, torch.float32 --> BF16, shape = {8960, 16}
INFO:hf-to-gguf:blk.0.ffn_down.weight.lora_b, torch.float32 --> BF16, shape = {16, 1536}
INFO:hf-to-gguf:blk.0.ffn_gate.weight.lora_a, torch.float32 --> BF16, shape = {1536, 16}
INFO:hf-to-gguf:blk.0.ffn_gate.weight.lora_b, torch.float32 --> BF16, shape = {16, 8960}
INFO:hf-to-gguf:blk.0.ffn_up.weight.lora_a, torch.float32 --> BF16, shape = {1536, 16}
INFO:hf-to-gguf:blk.0.ffn_up.weight.lora_b, torch.float32 --> BF16, shape = {16, 8960}
......
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/data2/anti_fraud/models/anti_fraud_v11/lora_0913_4_bf16.gguf: n_tensors = 392, total_size = 36.9M
Writing: 100%|██████████████████████████| 36.9M/36.9M [00:01<00:00, 21.4Mbyte/s]
INFO:lora-to-gguf:Model successfully exported to /data2/anti_fraud/models/anti_fraud_v11/lora_0913_4_bf16.gguf

执行完后，得到一个Lora适配器的gguf文件lora_0913_4_bf16.gguf。

3.3 合并

使用llama-export-lora工具将基座模型和Lora适配器合并为一个gguf文件。

python 复制代码

!/data2/downloads/llama.cpp/llama-export-lora \
    -m /data2/anti_fraud/models/anti_fraud_v11/qwen2_bf16.gguf \
    -o /data2/anti_fraud/models/anti_fraud_v11/model_bf16.gguf \
    --lora /data2/anti_fraud/models/anti_fraud_v11/lora_0913_4_bf16.gguf

-m: 指定基座模型的gguf文件。

--lora: 指定lora适配器的gguf文件。

-o: 指定合并后的模型文件。

python 复制代码

file_input: loaded gguf from /data2/anti_fraud/models/anti_fraud_v11/qwen2_bf16.gguf
file_input: loaded gguf from /data2/anti_fraud/models/anti_fraud_v11/lora_0913_4_bf16.gguf
copy_tensor :  blk.0.attn_k.bias [256, 1, 1, 1]
merge_tensor : blk.0.attn_k.weight [1536, 256, 1, 1]
merge_tensor :   + dequantize base tensor from bf16 to F32
merge_tensor :   + merging from adapter[0] type=bf16
merge_tensor :     input_scale=1.000000 calculated_scale=2.000000 rank=16
merge_tensor :   + output type is f16
copy_tensor :  blk.0.attn_norm.weight [1536, 1, 1, 1]
merge_tensor : blk.0.attn_output.weight [1536, 1536, 1, 1]
......
copy_tensor :  output_norm.weight [1536, 1, 1, 1]
copy_tensor :  token_embd.weight [1536, 151936, 1, 1]
run_merge : merged 196 tensors with lora adapters
run_merge : wrote 338 tensors to output file
done, output file is /data2/anti_fraud/models/anti_fraud_v11/model_bf16.gguf

查看导出的文件：

python 复制代码

-rw-rw-r-- 1   42885408 Nov  9 14:57 lora_0913_4_bf16.gguf
-rw-rw-r-- 1 3093666720 Nov  9 14:58 model_bf16.gguf
-rw-rw-r-- 1 3093666720 Nov  9 14:56 qwen2_bf16.gguf

经过上面三步，我们就将safetensors格式的基座模型和lora适配器导出为gguf格式的模型文件model_bf16.gguf，此时模型文件大小并没有变化，仍然有3G。

用llama-cli命令验证此模型文件是否能正常work。

llama-cli是一种命令行接口，允许用户只通过一条命令完成模型启动和模型访问，用于快速测试和调试。

python 复制代码

!/data2/downloads/llama.cpp/llama-cli --log-disable \
	-m /data2/anti_fraud/models/anti_fraud_v11/model_bf16.gguf \
	-p "我是一个来自太行山下小村庄家的孩子" \
	-n 100

复制代码

我是一个来自太行山下小村庄家的孩子，我叫李丽丽。我是一个很平凡的女孩，我平凡得像一颗小草，平凡得像一滴水，平凡得像一粒沙。但我有一颗不平凡的心，我有我独特的个性，我有我灿烂的微笑。

-m: 指定要使用的模型文件路径；

-p：指定文本生成的起始提示；

-n: 指定要生成的文本序列的最大长度；

--log-disable: 关闭多余的日志输出，只输出最终的文本。

4. 量化

使用llama-quantize工具将模型文件由16位量化为8位。

python 复制代码

!/data2/downloads/llama.cpp/llama-quantize \
/data2/anti_fraud/models/anti_fraud_v11/model_bf16.gguf /data2/anti_fraud/models/anti_fraud_v11/model_bf16_q8_0.gguf q8_0

model_bf16.gguf 是需要量化的模型文件；

model_bf16_q8_0.gguf 是量化后的模型文件；

q8_0 指定量化位宽为8位；

python 复制代码

main: build = 3646 (cddae488)
main: built with cc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 for x86_64-linux-gnu
main: quantizing '/data2/anti_fraud/models/anti_fraud_v11/model_bf16.gguf' to '/data2/anti_fraud/models/anti_fraud_v11/model_bf16_q8_0.gguf' as Q8_0

......

[ 337/ 338]                   output_norm.weight - [ 1536,     1,     1,     1], type =    f32, size =    0.006 MB
[ 338/ 338]                    token_embd.weight - [ 1536, 151936,     1,     1], type =   bf16, converting to q8_0 .. size =   445.12 MiB ->   236.47 MiB
llama_model_quantize_internal: model size  =  2944.68 MB
llama_model_quantize_internal: quant size  =  1564.62 MB

main: quantize time =  4671.71 ms
main:    total time =  4671.71 ms

经过量化后，模型文件由2944.68MB减小到1564.62MB，几乎缩小了一倍。

5. 运行

用llama-server工具以http服务的方式来运行模型，可用于生产环境的部署。

shell 复制代码

`export CUDA_VISIBLE_DEVICES="" `

shell 复制代码

llama-server \
	-m /data2/anti_fraud/models/anti_fraud_v11/model_bf16_q8_0.gguf \
	-ngl 28 -fa \
	--host 0.0.0.0 --port 8080

- host/-port: 指定要监听的IP地址和端口，0.0.0.0表示监听所有网卡的IP地址。
注：如果只想在CPU上运行，可以将CUDA_VISIBLE_DEVICES环境变量置为空串。

6. 访问

llama.cpp提供了UI的方式来轻松访问，只要在浏览器里输入上面启动时所用的机器IP地址和服务监听的端口号即可访问。

python 复制代码

http://xxx.xxx.xxx.xxx:8080

界面如下：

如上界面所示，它已经默认配置了一套推理参数，包括系统提示语prompt，多轮对话模板prompt_template、温度等，你可以轻松设置自己的参数。

对话示意如下：

除了UI方式以外，也支持通过http接口的形式访问。

python 复制代码

%%time
!curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{ \
        "messages": [{"role": "user", "content": "你是一个分析诈骗案例的专家，你的任务是分析下面对话内容是否存在经济诈骗(is_fraud:<bool>)，如果存在经济诈骗，请找出正在进行诈骗行为的发言者姓名(fraud_speaker:<str>)，并给出你的分析理由(reason:<str>)，最后以json格式输出。\n\n张伟:您好，请问是林女士吗？我是中通快递客服，我姓张。您前几天网上买了一辆自行车对吧？很抱歉，我们的快递弄丢了，按规定我们会赔偿您360元。"}],\
        "max_tokens": 512,\
        "temperature": 0  \
    }' | jq

llama.cpp提供的http接口，在接口定义、请求参数、响应格式方面与OpenAI完全兼容，这可以很方便的和应用程序集成。

json 复制代码

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "{\"is_fraud\": true, \"fraud_speaker\": \"张伟\", \"reason\": \"对话中的张伟自称是中通快递客服，并提到用户的网购行为和物品丢失，要求赔偿。正常快递公司不会通过电话直接告知赔偿，而且金额较大，存在诱导用户提供个人信息或进行资金转账的嫌疑，符合常见的诈骗手法。\"}",
        "role": "assistant"
      }
    }
  ],
  "created": 1731148713,
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 75,
    "prompt_tokens": 118,
    "total_tokens": 193
  },
  "id": "chatcmpl-XEfFuF4pZt7h57DG0F7jhDbmG3ByAOlO"
}
CPU times: user 155 ms, sys: 10.5 ms, total: 165 ms
Wall time: 5.09 s

至此，我们成功的将微调后的模型在CPU上运行起来。

小结：本文从llama.cpp的安装开始，一步一步演示了如何将lora微调后的模型导出为gguf格式，并通过量化来减小模型的体积，并最终在cpu上跑起来的完整过程。gguf格式的模型只有单个文件，并且不依赖pytorch等复杂的环境，积大的简化了模型的运行的部署工作。

欺诈文本分类检测（十八）：基于llama.cpp+CPU推理

1. 前言

2. 安装

3. 模型文件转换

3.1 基座模型转换

3.2 lora适配器转换

3.3 合并

4. 量化

5. 运行

6. 访问

参考资料：