模型版本:https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF GLM-4.7-Flash-UD-Q4_K_XL.gguf
llama.cpp版本:b7933
root@dev:~# llama-bench -m glm-4.7-flash.gguf
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| deepseek2 30B.A3B Q4_K - Medium | 16.31 GiB | 29.94 B | ROCm | 99 | pp512 | 918.25 ± 1.37 |
| deepseek2 30B.A3B Q4_K - Medium | 16.31 GiB | 29.94 B | ROCm | 99 | tg128 | 54.62 ± 0.12 |
实际使用体验:
srv params_from_: Chat format: GLM 4.5
slot get_availabl: id 1 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 1.000
slot launch_slot_: id 1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 1 | task 12639 | processing task, is_child = 0
slot update_slots: id 1 | task 12639 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 32375
slot update_slots: id 1 | task 12639 | n_tokens = 32366, memory_seq_rm [32366, end)
slot update_slots: id 1 | task 12639 | prompt processing progress, n_tokens = 32375, batch.n_tokens = 9, progress = 1.000000
slot update_slots: id 1 | task 12639 | prompt done, n_tokens = 32375, batch.n_tokens = 9
slot init_sampler: id 1 | task 12639 | init sampler, took 4.34 ms, tokens: text = 32375, total = 32375
slot print_timing: id 1 | task 12639 |
prompt eval time = 176.24 ms / 9 tokens ( 19.58 ms per token, 51.07 tokens per second)
eval time = 1334.39 ms / 46 tokens ( 29.01 ms per token, 34.47 tokens per second)
total time = 1510.63 ms / 55 tokens
上下文32k时,提示词解码速度:51.07 tokens per second, 生成速度:34.47 tokens per second