先到hf-mirror网站下载GGUF格式模型,https://hf-mirror.com/tencent/Hy-MT2-1.8B-GGUF/tree/main, modelscope网站还未提供此格式, https://modelscope.cn/models/Tencent-Hunyuan/Hy-MT2-1.8B
下载如下文件:
C:\d>curl -LO https://hf-mirror.com/tencent/Hy-MT2-1.8B-GGUF/resolve/main/Hy-MT2-1.8B-Q4_K_M.gguf -C -
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1365 0 1365 0 0 1408 0 0
100 1.05G 100 1.05G 0 0 7.64M 0 02:21 02:21 8.39M
再到llama.cpp的github存储库,下载最新版本llama预编译可执行文件,选择vulkan版本,与cpu版本的区别就是多了一个56MB的ggml-vulkan.dll,它会自动检测显卡类型。
C:\d>curl -LO https://github.com/ggml-org/llama.cpp/releases/download/b9279/llama-b9279-bin-win-cpu-x64.zip
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 00:03 0
22 15.18M 22 3.35M 0 0 33555 0 07:54 01:44 06:10 30056^C
C:\d>curl -LO https://github.com/ggml-org/llama.cpp/releases/download/b9279/llama-b9279-bin-win-vulkan-x64.zip -C -
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 0
100 31.17M 100 31.17M 0 0 35429 0 15:22 15:22 38437
为了看懂基准测试输出,摘录这里的参数含义
参数
Q4_0 是什么
Q4_0 是一种 4-bit 量化格式。它的意义不是"模型更强",而是"模型更小、更省显存、更容易塞进更多设备里"。这些榜单大多统一用 Llama 2 7B, Q4_0,核心目的是减少变量,让不同 GPU 的成绩更容易横向比较。
pp512 是什么
pp512 一般可以理解为 prompt processing 512 tokens,也就是处理 512 个输入 token 时的吞吐。
pp = prompt processing
512 = 输入长度是 512 token
t/s = tokens per second
它更像"吃提示词的速度",通常能并行得更充分,所以数字往往很高。
tg128 是什么
tg128 一般可以理解为 text generation 128 tokens,也就是连续生成 128 个 token 时的速度。
tg = text generation
128 = 连续生成 128 token
t/s = tokens per second
它更接近我们平时感受到的"模型回答快不快"。因为生成阶段是逐 token 递推,所以通常明显低于 pp512。
基准测试:
C:\d\llama260522>llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf -ngl 0
load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\d\llama260522\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | Vulkan | 0 | pp512 | 592.38 ± 13.29 |
| hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | Vulkan | 0 | tg128 | 45.02 ± 0.42 |
build: 47c0eda9d (9279)
可见,它检测出了我的集成显卡AMD Radeon 780M Graphics。把ggml-vulkan.dll文件改名,重新执行,这次后台就是CPU,pp512减少了近一半,tg128保持不变。
C:\d\llama260522>llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf -ngl 0
load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll
load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | CPU | 8 | pp512 | 339.36 ± 10.26 |
| hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | CPU | 8 | tg128 | 45.39 ± 0.11 |
build: 47c0eda9d (9279)
参阅文档,https://juejin.cn/post/7382216166486540339,了解到:
-ngl N, --n-gpu-layers N:
当使用GPU支持编译时,此选项允许将一些层卸载到GPU进行计算。
通常会提高性能。
现在这个参数为0,再恢复文件,去掉-ngl 0参数
C:\d\llama260522>llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf
load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\d\llama260522\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | Vulkan | 99 | pp512 | 844.50 ± 9.69 |
| hunyuan-dense 1.8B Q4_K - Medium | 1.05 GiB | 1.79 B | Vulkan | 99 | tg128 | 59.84 ± 0.31 |
build: 47c0eda9d (9279)
这次pp512和tg128都比Vulkan -ngl 0提升了30%。
运行一个 completion 示例
C:\d\llama260522>llama-completion --model ..\Hy-MT2-1.8B-Q4_K_M.gguf -p "Translate the following segment into Chinese, without additional explanation:Hello" --jinja -ngl 0 -n 64 -st
0.00.078.290 I llama_completion: llama backend init
0.00.078.296 I llama_completion: load the model and apply lora adapter, if any
0.00.078.303 I common_init_result: fitting params to device memory ...
0.00.078.304 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.408.458 W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
0.09.586.475 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.20.135.187 I llama_completion: llama threadpool init, n_threads = 8
0.20.136.500 I llama_completion: chat template is available, enabling conversation mode (disable it with -no-cnv)
0.20.136.506 W *** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
0.20.148.699 I llama_completion: chat template example:
<|hy_begin▁of▁sentence|>You are a helpful assistant<|hy_place▁holder▁no▁3|><|hy_User|>Hello<|hy_Assistant|>Hi there<|hy_place▁holder▁no▁2|><|hy_User|>How are you?<|hy_Assistant|>
0.20.148.709 I
0.20.149.456 I system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.20.149.458 I
0.20.161.695 I sampler seed: 3367966364
0.20.161.906 I sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
top_k = 20, top_p = 0.800, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
0.20.162.115 I sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
0.20.162.118 I generate: n_ctx = 262144, n_batch = 2048, n_predict = 64, n_keep = 0
0.20.162.118 I
Translate the following segment into Chinese, without additional explanation��Hello你好 [end of text]
0.21.819.501 I common_perf_print: sampling time = 0.64 ms
0.21.819.505 I common_perf_print: samplers time = 0.09 ms / 17 tokens
0.21.819.506 I common_perf_print: load time = 19767.05 ms
0.21.819.511 I common_perf_print: prompt eval time = 1611.80 ms / 15 tokens ( 107.45 ms per token, 9.31 tokens per second)
0.21.819.513 I common_perf_print: eval time = 36.48 ms / 1 runs ( 36.48 ms per token, 27.41 tokens per second)
0.21.819.514 I common_perf_print: total time = 1684.68 ms / 16 tokens
0.21.819.515 I common_perf_print: unaccounted time = 35.77 ms / 2.1 % (total - sampling - prompt eval - eval) / (total)
0.21.819.516 I common_perf_print: graphs reused = 0
C:\d\llama260522>
用CLI测试, 不知为何,翻译了一句就退出。用读入文件的方法也一样,翻译了一句就退出。
C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf
Loading model... /
C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf --jinja -ngl 0 -n 64 -st
Loading model...
build : b9279-47c0eda9d
model : Hy-MT2-1.8B-Q4_K_M.gguf
modalities : text
> 请将以下文本准确翻译为英文。
Please translate the text accurately into English.
[ Prompt: 9.9 t/s | Generation: 50.7 t/s ]
Exiting...
C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf --jinja -ngl 0 -n 64 -st
Loading model...
build : b9279-47c0eda9d
model : Hy-MT2-1.8B-Q4_K_M.gguf
modalities : text
> /read ..\eng.txt
Loaded text from '..\eng.txt'
> 译成中文
--- 文件:..\eng.txt ---
简要总结:Lance 是一种开放性的 Lakehouse 格式,专为 AI 工作负载设计。LanceDB 与 DuckDB Labs 合作,让您能够直接在 DuckDB SQL 中执行快速向量和混合搜索,而无需中断您的分析工作流
[ Prompt: 49.9 t/s | Generation: 43.1 t/s ]
Exiting...
补记:多次尝试,发现使用-ngl 10可以使用Vulkan不退出,查询帮助信息,得到:
-st, --single-turn run conversation for a single turn only, then exit when done
will not be interactive if first turn is predefined with --prompt
(default: false)去掉-st就能翻译多句。
C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf -ngl 10
Loading model...
...
**未来趋势**
未来一年需要关注的是,针对PG19/PG20的列存储工作和TAM扩展生态是否会融合或分化。
社区方向普遍倾向于更高的可插拔性------更细粒度的TAM钩子,不需要每次都解析pgsql-hackers的修补的规划器集成点,以及真正能够反映自定义存储成本的估算机制。供应商方向(Snowflake、Databricks、Microsoft等,它们的Postgres产品底层都有定制存储层)则远离这一方向,因为他们的差异化优势存在于TAM线以下,可插拔性会稀释其护城河。
未来两年的架构思维走向将决定2028年"Postgres"的含义。我有个人偏好,你们可以猜测是什么。
目前的实际答案是:运行基准测试。这两个扩展都有足够稳定的版本可供测试。TAM时代已不再只是假设。
https://thebuild.com/blog/2026/05/20/table-access-methods-wake-up/
[ Prompt: 365.9 t/s | Generation: 32.1 t/s ]