GLM-edge-1.5B-chat 一个特别的cpu可以推理的小型llm模型

GLM-edge-1.5B-chat 是一个非常典型的、面向特定应用场景（边缘设备）而优化的轻量级大语言模型。

使用transformers推理：https://hf-mirror.com/zai-org/glm-edge-1.5b-chat/blob/main/README_zh.md

使用ollama推理：

https://hf-mirror.com/zai-org/glm-edge-1.5b-chat-gguf/blob/main/README_zh.md

模型下载地址：

魔搭：https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat-gguf/

Huggingface镜像下载地址：https://d6108366.hf-mirror.com/zai-org/glm-edge-1.5b-chat-gguf/blob/main/README_zh.md

现在Huggingface镜像下载经常要登录啥的，所以如果不想麻烦，可以从魔搭下载。Ollama也支持自动从魔搭下载模型。

使用ollama推理

复制代码

ollama run hf.co/zai-org/glm-edge-1.5b-chat-gguf:Q4_K_M
# 或
ollama run modelscope.cn/ZhipuAI/glm-edge-1.5b-chat-gguf:Q4_K_M
ollama run modelscope.cn/ZhipuAI/glm-edge-1.5b-chat-gguf:Q5_0
ollama run modelscope.cn/ZhipuAI/glm-edge-1.5b-chat-gguf

但是推理效果吧，略微有些不满足需求

复制代码

E:\github\llama.cpp>ollama run modelscope.cn/ZhipuAI/glm-edge-1.5b-chat-gguf:Q4_K_M
>>> 你好
！请问您是哪个地区的？作为一个人工智能助手，我没有地区属性，我可以回答任何问题。您有什么想咨询的吗？
好的，请问需要我提供关于某个话题的信息吗？

>>> 怎么设置环境变量，让模型从魔搭社区下载模型
？
谢谢您的提问！在配置和使用环境变量时，需要根据具体使用的框架或软件平台进行操作，例如Python环境中通常使用`os`模块来设
置和获取环境变量。但如果您指的是如何在魔搭社区（Made-in-Community）下载或导入某个特定模型的代码，这可能需要该平台的
官方文档或技术支持的帮助。

具体的操作步骤可能会因为不同的模型和服务而异，通常包括以下步骤：

1. **访问魔搭社区**：登录到您所在的魔搭社区平台。
2. **定位目标模型**：在平台上找到并识别出您想要下载的模型。
3. **查看官方文档**：参考魔搭社区提供的官方文档或教程，了解如何下载该模型的代码或配置文件。
4. **遵循指导操作**：如果有步骤详细性的说明，仔细阅读并按照指示进行。

如果以上步骤还不能帮助到您，请提供更多关于魔搭社区的上下文信息，或者直接告诉我您遇到的具体问题，我会尽我所能提供帮助
。

这个速度呢，也没有想象的那么快。

来个经典题

复制代码

G:\Espressif\tools\riscv32-esp-elf\esp-14.2.0_20241119\riscv32-esp-elf\bin>ollama run modelscope.cn/ZhipuAI/glm-edge-1.5b-chat-gguf:Q4_K_M
>>> 树上10只鸟，打死2只，树上还有几只鸟？
1. **理解问题**：题目中说有10只鸟在树上，然后打死了2只，问剩下的鸟有多少只。

2. **分析情况**：需要考虑打死的两只鸟是否会影响剩余鸟的数量。如果打死的是属于树上的其他非目标鸟，那么总数量不会改变
；如果打死的是树上的那只特定要捕杀的鸟，那么就要减去一只。

3. **确定假设和操作**：
   - 假设10只是正常存在的鸟，不包括那些被打死的。
   - 或者，如果考虑打死的那两只都属于树上原本就有的鸟，那么树上总共有9只鸟。

4. **得出结论**：不管是哪一种情况，剩下的鸟的数量要么是10只，要么是9只。

所以，根据问题的描述和逻辑分析，树上最后应该有10只鸟或者9只鸟。

这...不对啊，总感觉它有点不聪明的样子。

使用 llama.cpp推理

llama.cpp下明显就聪明很多。

下载llama.cpp源代码

复制代码

git clone https://github.com/piDack/llama.cpp -b support_glm_edge_model

编译安装llama.cpp

复制代码

cmake -B build -DGGML_CUDA=ON # 或开启其他加速硬件
# 或cpu
cmake -B build -DGGML_CUDA=OFF # 
cmake --build build -j

编译完成

复制代码

E:\github\llama.cpp\tools\server\server-context.cpp(3343,31): warning C4267: '=': conversion from 'size_t' to 'int', po
ssible loss of data [E:\github\llama.cpp\build\tools\server\llama-server.vcxproj]
E:\github\llama.cpp\tools\server\server-context.cpp(3602,27): warning C4267: '=': conversion from 'size_t' to 'int', po
ssible loss of data [E:\github\llama.cpp\build\tools\server\llama-server.vcxproj]
  llama-export-lora.vcxproj -> E:\github\llama.cpp\build\bin\Debug\llama-export-lora.exe
  llama-lookup-stats.vcxproj -> E:\github\llama.cpp\build\bin\Debug\llama-lookup-stats.exe
  llama-perplexity.vcxproj -> E:\github\llama.cpp\build\bin\Debug\llama-perplexity.exe
  llama-mtmd-cli.vcxproj -> E:\github\llama.cpp\build\bin\Debug\llama-mtmd-cli.exe
  llama-tts.vcxproj -> E:\github\llama.cpp\build\bin\Debug\llama-tts.exe
  llama-server.vcxproj -> E:\github\llama.cpp\build\bin\Debug\llama-server.exe
  Building Custom Rule E:/github/llama.cpp/CMakeLists.txt

推理

从魔搭社区下载模型：https://d6108366.hf-mirror.com/zai-org/glm-edge-1.5b-chat-gguf

下载到

复制代码

E:\360Downloads\ggml-model-Q5_0.gguf

然后进行推理：

复制代码

llama-cli -m <path>/model.gguf -p "<|user|>\nhi<|assistant|>\n" -ngl 999

实际推理指令：

复制代码

llama-cli -m E:\360Downloads\ggml-model-Q5_0.gguf -p "<|user|>\nhi<|assistant|>\n" -ngl 999

推理输出

复制代码

hi

Hello 👋! I'm ChatGLM（智谱清言）, the artificial intelligence assistant, nice to meet you. Feel free to ask me any questions.

> 你好

你好👋！我是智谱清言（ChatGLM），很高兴见到你。有什么我可以帮助你的吗？

> 树上10只鸟，打死2只，树上还有几只鸟？

这个问题是一个经典的逻辑谜题。通常情况下，如果打死两只鸟，那么剩下的两只鸟仍然会飞走，所以树上不会剩下任何鸟。这个问题可能是在考验逻辑思维能力，或者是在设置一个谜语来引发思考。然而，在现实生活中，如果真的有动物在树上，人类是不应该伤害它们的。所以这个问题更多是一个思维游戏，而不是一个实际情境。

逻辑在线，不错！

这样看来难道是模型参数导致的？

为什么两个推理效果差那么多？

观察到同一个模型（glm-edge-1.5b-chat-gguf:Q5_0）在Ollama和llama.cpp中表现差异显著，根本原因是提示模板（Prompt Template）没有正确应用。

问题核心：提示模板不一致

ChatGLM系列模型需要特定的提示格式才能正常工作：

复制代码

Markdown

<|system|> {系统提示} <|user|> {用户输入} <|assistant|> {模型回复}

llama.cpp（正确用法）：
复制代码
Bash
llama-cli.exe -m model.gguf -p "<|user|>\nhi<|assistant|>\n"

你明确指定了<|user|>和<|assistant|>标签，激活了模型的对话能力。
Ollama（错误用法）：
复制代码
Bash
ollama run modelscope.cn/ZhipuAI/glm-edge-1.5b-chat-gguf:Q5_0

Ollama默认使用通用提示模板 （如[INST]标签），导致：
- 模型无法识别对话角色
- 上下文理解能力下降
- 生成结果偏离预期

🔧 解决方案：为Ollama添加ChatGLM模板

在Ollama中创建自定义模型文件（Modelfile）：

复制代码

FROM modelscope.cn/ZhipuAI/glm-edge-1.5b-chat-gguf:Q5_0

# 添加ChatGLM专用模板
TEMPLATE """{% for message in messages %}
    {% if message['role'] == 'system' %}
        <|system|>\n{{ message['content'] }}
    {% elif message['role'] == 'user' %}
        <|user|>\n{{ message['content'] }}
    {% elif message['role'] == 'assistant' %}
        <|assistant|>\n{{ message['content'] }}
    {% endif %}
{% endfor %}
<|assistant|>\n"""

执行：

复制代码

# 1. 创建Modelfile（复制上述内容）
nano Modelfile

# 2. 构建自定义模型
ollama create my-glm -f Modelfile

# 3. 使用修复版模型
ollama run my-glm

当然我这里用的windows系统，所以是用记事本创建的Modelfile文件，实际执行

复制代码

ollama create smart-glm-edge-1.5b-chat -f Modelfile.txt

create的时候后面一直报错，就先不整了。反正大概意思就是需要加上模版。

总结

不明白为什么会这样，同样是glm-edge-1.5b-chat-gguf:Q5_0模型，使用Ollama推理，智商捉急，用llama.cpp推理，智商在线。（估计是通用提示模板的锅）

Ollama推理

E:\github\llama.cpp\build\bin\Release> ollama run modelscope.cn/ZhipuAI/glm-edge-1.5b-chat-gguf:Q5_0

pulling manifest

pulling d99992280e8a: 100% ▕██████████████████████████████████████████████████████████▏ 1.1 GB

pulling bfb925240112: 100% ▕██████████████████████████████████████████████████████████▏ 194 B

verifying sha256 digest

writing manifest

success

>>> 树上10只鸟，打死2只，树上还有几只鸟？

在现实中，如果打死两只鸟，理论上可能会有新的鸟种出现，这可能会引发一些生态和环境的关注。但根据一般的动物行为学理论，

打死一只或两只不会对其他鸟有直接影响，因为它们通常会离开死亡的同伴。因此，理论上，如果不考虑新的生态变化或者人为干预

的情况，树上应该还有10只鸟。

llama.cpp

E:\github\llama.cpp\build\bin\Release>llama-cli.exe -m E:\360Downloads\ggml-model-Q5_0.gguf -p "<|user|>\nhi<|assistant|>\n" -ngl 999

Hello 👋! I'm ChatGLM（智谱清言）, the artificial intelligence assistant, nice to meet you. Feel free to ask me any questions.

> 你好

你好👋！我是智谱清言（ChatGLM），很高兴见到你。有什么我可以帮助你的吗？

> 树上10只鸟，打死2只，树上还有几只鸟？

这个问题是一个经典的逻辑谜题。通常情况下，如果打死两只鸟，那么剩下的两只鸟仍然会飞走，所以树上不会剩下任何鸟。这个问题可能是在考验逻辑思维能力，或者是在设置一个谜语来引发思考。然而，在现实生活中，如果真的有动物在树上，人类是不应该伤害它们的。所以这个问题更多是一个思维游戏，而不是一个实际情境。

调试

编译llama.cpp报错

复制代码

  E:\github\llama.cpp\build\ggml\src\ggml-cuda>"E:\Program Files\cuda12\bin\nvcc.exe"  --use-local-env -ccbin "e:\Progr
  am Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64" -x cu   -I"E:\github\llama
  .cpp\ggml\src\ggml-cuda\.." -IE:\github\llama.cpp\ggml\src\..\include -I"E:\Program Files\cuda\include" -I"E:\Program
   Files\cuda12\include"     --keep-dir x64\Debug -use_fast_math -maxrregcount=0   --machine 64 --compile -cudart stati
  c -std=c++17 -arch=native -extended-lambda -Xcompiler="/EHsc -Zi -Ob0" -g  -D_WINDOWS -DGGML_BACKEND_BUILD -DGGML_BAC
  KEND_SHARED -D_CRT_SECURE_NO_WARNINGS -DGGML_SCHED_MAX_COPIES=4 -D_XOPEN_SOURCE=600 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=1
  28 -DGGML_CUDA_USE_GRAPHS -DGGML_SHARED -D"CMAKE_INTDIR=\"Debug\"" -Dggml_cuda_EXPORTS -D_WINDLL -D_MBCS -D"CMAKE_INT
  DIR=\"Debug\"" -Dggml_cuda_EXPORTS -Xcompiler "/EHsc /W1 /nologo /Od /FS /Zi /RTC1 /MDd /GR" -Xcompiler "/Fdggml-cuda
  .dir\Debug\vc143.pdb" -o ggml-cuda.dir\Debug\fattn-vec-instance-q4_0-q4_0.obj "E:\github\llama.cpp\ggml\src\ggml-cuda
  \template-instances\fattn-vec-instance-q4_0-q4_0.cu"

  E:\github\llama.cpp\build\ggml\src\ggml-cuda>"E:\Program Files\cuda12\bin\nvcc.exe"  --use-local-env -ccbin "e:\Progr
  am Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64" -x cu   -I"E:\github\llama
  .cpp\ggml\src\ggml-cuda\.." -IE:\github\llama.cpp\ggml\src\..\include -I"E:\Program Files\cuda\include" -I"E:\Program
   Files\cuda12\include"     --keep-dir x64\Debug -use_fast_math -maxrregcount=0   --machine 64 --compile -cudart stati
  c -std=c++17 -arch=native -extended-lambda -Xcompiler="/EHsc -Zi -Ob0" -g  -D_WINDOWS -DGGML_BACKEND_BUILD -DGGML_BAC
  KEND_SHARED -D_CRT_SECURE_NO_WARNINGS -DGGML_SCHED_MAX_COPIES=4 -D_XOPEN_SOURCE=600 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=1
  28 -DGGML_CUDA_USE_GRAPHS -DGGML_SHARED -D"CMAKE_INTDIR=\"Debug\"" -Dggml_cuda_EXPORTS -D_WINDLL -D_MBCS -D"CMAKE_INT
  DIR=\"Debug\"" -Dggml_cuda_EXPORTS -Xcompiler "/EHsc /W1 /nologo /Od /FS /Zi /RTC1 /MDd /GR" -Xcompiler "/Fdggml-cuda
  .dir\Debug\vc143.pdb" -o ggml-cuda.dir\Debug\fattn-vec-instance-q8_0-q8_0.obj "E:\github\llama.cpp\ggml\src\ggml-cuda
  \template-instances\fattn-vec-instance-q8_0-q8_0.cu"

  E:\github\llama.cpp\build\ggml\src\ggml-cuda>"E:\Program Files\cuda12\bin\nvcc.exe"  --use-local-env -ccbin "e:\Progr
  am Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64" -x cu   -I"E:\github\llama
  .cpp\ggml\src\ggml-cuda\.." -IE:\github\llama.cpp\ggml\src\..\include -I"E:\Program Files\cuda\include" -I"E:\Program
   Files\cuda12\include"     --keep-dir x64\Debug -use_fast_math -maxrregcount=0   --machine 64 --compile -cudart stati
  c -std=c++17 -arch=native -extended-lambda -Xcompiler="/EHsc -Zi -Ob0" -g  -D_WINDOWS -DGGML_BACKEND_BUILD -DGGML_BAC
  KEND_SHARED -D_CRT_SECURE_NO_WARNINGS -DGGML_SCHED_MAX_COPIES=4 -D_XOPEN_SOURCE=600 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=1
  28 -DGGML_CUDA_USE_GRAPHS -DGGML_SHARED -D"CMAKE_INTDIR=\"Debug\"" -Dggml_cuda_EXPORTS -D_WINDLL -D_MBCS -D"CMAKE_INT
  DIR=\"Debug\"" -Dggml_cuda_EXPORTS -Xcompiler "/EHsc /W1 /nologo /Od /FS /Zi /RTC1 /MDd /GR" -Xcompiler "/Fdggml-cuda
  .dir\Debug\vc143.pdb" -o ggml-cuda.dir\Debug\fattn-vec-instance-f16-f16.obj "E:\github\llama.cpp\ggml\src\ggml-cuda\t
  emplate-instances\fattn-vec-instance-f16-f16.cu"
  nvcc fatal   : Unsupported gpu architecture 'compute_30'
e:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.4.targets
(799,9): error MSB3721: The command ""E:\Program Files\cuda12\bin\nvcc.exe"  --use-local-env -ccbin "e:\Program Files\M
icrosoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64" -x cu   -I"E:\github\llama.cpp\ggml\sr
c\ggml-cuda\.." -IE:\github\llama.cpp\ggml\src\..\include -I"E:\Program Files\cuda\include" -I"E:\Program Files\cuda12\
include"     --keep-dir x64\Debug -use_fast_math -maxrregcount=0   --machine 64 --compile -cudart static -std=c++17 -ar
ch=native -extended-lambda -Xcompiler="/EHsc -Zi -Ob0" -g  -D_WINDOWS -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -D_CRT
_SECURE_NO_WARNINGS -DGGML_SCHED_MAX_COPIES=4 -D_XOPEN_SOURCE=600 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_G
RAPHS -DGGML_SHARED -D"CMAKE_INTDIR=\"Debug\"" -Dggml_cuda_EXPORTS -D_WINDLL -D_MBCS -D"CMAKE_INTDIR=\"Debug\"" -Dggml_
cuda_EXPORTS -Xcompiler "/EHsc /W1 /nologo /Od /FS /Zi /RTC1 /MDd /GR" -Xcompiler "/Fdggml-cuda.dir\Debug\vc143.pdb" -o
 ggml-cuda.dir\Debug\fattn-vec-instance-q4_0-q4_0.obj "E:\github\llama.cpp\ggml\src\ggml-cuda\template-instances\fattn-
vec-instance-q4_0-q4_0.cu"" exited with code 1. [E:\github\llama.cpp\build\ggml\src\ggml-cuda\ggml-cuda.vcxproj]
  nvcc fatal   : Unsupported gpu architecture 'compute_30'
  nvcc fatal   : Unsupported gpu architecture 'compute_30'
e:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.4.targets
(799,9): error MSB3721: The command ""E:\Program Files\cuda12\bin\nvcc.exe"  --use-local-env -ccbin "e:\Program Files\M
icrosoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64" -x cu   -I"E:\github\llama.cpp\ggml\sr
c\ggml-cuda\.." -IE:\github\llama.cpp\ggml\src\..\include -I"E:\Program Files\cuda\include" -I"E:\Program Files\cuda12\
include"     --keep-dir x64\Debug -use_fast_math -maxrregcount=0   --machine 64 --compile -cudart static -std=c++17 -ar
ch=native -extended-lambda -Xcompiler="/EHsc -Zi -Ob0" -g  -D_WINDOWS -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -D_CRT
_SECURE_NO_WARNINGS -DGGML_SCHED_MAX_COPIES=4 -D_XOPEN_SOURCE=600 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_G
RAPHS -DGGML_SHARED -D"CMAKE_INTDIR=\"Debug\"" -Dggml_cuda_EXPORTS -D_WINDLL -D_MBCS -D"CMAKE_INTDIR=\"Debug\"" -Dggml_
cuda_EXPORTS -Xcompiler "/EHsc /W1 /nologo /Od /FS /Zi /RTC1 /MDd /GR" -Xcompiler "/Fdggml-cuda.dir\Debug\vc143.pdb" -o
 ggml-cuda.dir\Debug\fattn-vec-instance-q8_0-q8_0.obj "E:\github\llama.cpp\ggml\src\ggml-cuda\template-instances\fattn-
vec-instance-q8_0-q8_0.cu"" exited with code 1. [E:\github\llama.cpp\build\ggml\src\ggml-cuda\ggml-cuda.vcxproj]
e:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.4.targets
(799,9): error MSB3721: The command ""E:\Program Files\cuda12\bin\nvcc.exe"  --use-local-env -ccbin "e:\Program Files\M
icrosoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64" -x cu   -I"E:\github\llama.cpp\ggml\sr
c\ggml-cuda\.." -IE:\github\llama.cpp\ggml\src\..\include -I"E:\Program Files\cuda\include" -I"E:\Program Files\cuda12\
include"     --keep-dir x64\Debug -use_fast_math -maxrregcount=0   --machine 64 --compile -cudart static -std=c++17 -ar
ch=native -extended-lambda -Xcompiler="/EHsc -Zi -Ob0" -g  -D_WINDOWS -DGGML_BACKEND_BUILD -DGGML_BACKEND_SHARED -D_CRT
_SECURE_NO_WARNINGS -DGGML_SCHED_MAX_COPIES=4 -D_XOPEN_SOURCE=600 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_USE_G
RAPHS -DGGML_SHARED -D"CMAKE_INTDIR=\"Debug\"" -Dggml_cuda_EXPORTS -D_WINDLL -D_MBCS -D"CMAKE_INTDIR=\"Debug\"" -Dggml_
cuda_EXPORTS -Xcompiler "/EHsc /W1 /nologo /Od /FS /Zi /RTC1 /MDd /GR" -Xcompiler "/Fdggml-cuda.dir\Debug\vc143.pdb" -o
 ggml-cuda.dir\Debug\fattn-vec-instance-f16-f16.obj "E:\github\llama.cpp\ggml\src\ggml-cuda\template-instances\fattn-ve
c-instance-f16-f16.cu"" exited with code 1. [E:\github\llama.cpp\build\ggml\src\ggml-cuda\ggml-cuda.vcxproj]

错误原因‌

GPU 架构不支持，直接用cpu模式编译即可。