MI50运行GLM-4.7-Flash的速度测试

模型版本:https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF GLM-4.7-Flash-UD-Q4_K_XL.gguf

llama.cpp版本:b7933

复制代码
root@dev:~# llama-bench -m glm-4.7-flash.gguf 
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| deepseek2 30B.A3B Q4_K - Medium |  16.31 GiB |    29.94 B | ROCm       |  99 |           pp512 |        918.25 ± 1.37 |
| deepseek2 30B.A3B Q4_K - Medium |  16.31 GiB |    29.94 B | ROCm       |  99 |           tg128 |         54.62 ± 0.12 |

实际使用体验:

复制代码
srv  params_from_: Chat format: GLM 4.5
slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  1 | task 12639 | processing task, is_child = 0
slot update_slots: id  1 | task 12639 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 32375
slot update_slots: id  1 | task 12639 | n_tokens = 32366, memory_seq_rm [32366, end)
slot update_slots: id  1 | task 12639 | prompt processing progress, n_tokens = 32375, batch.n_tokens = 9, progress = 1.000000
slot update_slots: id  1 | task 12639 | prompt done, n_tokens = 32375, batch.n_tokens = 9
slot init_sampler: id  1 | task 12639 | init sampler, took 4.34 ms, tokens: text = 32375, total = 32375
slot print_timing: id  1 | task 12639 | 
prompt eval time =     176.24 ms /     9 tokens (   19.58 ms per token,    51.07 tokens per second)
       eval time =    1334.39 ms /    46 tokens (   29.01 ms per token,    34.47 tokens per second)
      total time =    1510.63 ms /    55 tokens

上下文32k时,提示词解码速度:51.07 tokens per second, 生成速度:34.47 tokens per second

相关推荐
晨欣9 天前
单卡 L20 48GB实测 | 同是 Q8_0,为什么 Qwen3.6 在 llama.cpp 长上下文下比 Qwen3.5 更慢?
llama.cpp·qwen3.6-35b-a3b·qwen3.5-35b-a3b
Lethehong9 天前
构建高精度智能财经分析工作流:基于 Dify、蓝耘 GLM-5.1 与 Tavily 的实践指南
人工智能·dify·glm·蓝耘元生代·蓝耘maas
d1z88817 天前
(二十)32天GPU测试从入门到精通-llama.cpp CPU/GPU 混合推理day18
人工智能·llama·显卡·llama.cpp
gergul19 天前
在llama-cpp-python中使用自己编译的llama.cpp,解决pip install llama-cpp-python报错
python·llama·llama.cpp·llamacpppython
YoungHong199219 天前
智谱国际版(国内可直连) GLM-5.1 接入 Claude Code 指南
glm·智谱·claudecode
百度智能云技术站21 天前
百度百舸 x 昆仑芯,加速 GLM-5.1 从开源发布到规模化应用
开源·glm·百度百舸·昆仑芯
晨欣22 天前
单卡 48GB 实测:Gemma 4 26B A4B、Gemma 4 31B、gpt-oss-20b 三模型部署与并发对比
google·openai·nvidia·vllm·llama.cpp·gpt-oss-20b·gemma4
belldeep25 天前
AI: ggml llama.cpp 与 BitNet 模型介绍
人工智能·llama.cpp·bitnet·gguf·ggml
努力不熬夜 ^_^1 个月前
我用 GLM-5.1 重构了我的 AI 项目
java·重构·react·glm·vibe coding·coding plan
love530love1 个月前
【独家资源】Windows 本地部署微软 BitNet b1.58: Flash Attention + CUDA GPU 加速 (sm_86) + AVX2 优化 + 1.58bit 量化
人工智能·windows·microsoft·llama.cpp·bitnet·flash attention·bitlinear_cpp