利用llama-vulkan版本测试腾讯混元Hy-MT2多语言翻译模型

先到hf-mirror网站下载GGUF格式模型，https://hf-mirror.com/tencent/Hy-MT2-1.8B-GGUF/tree/main, modelscope网站还未提供此格式， https://modelscope.cn/models/Tencent-Hunyuan/Hy-MT2-1.8B

下载如下文件：

复制代码

C:\d>curl -LO https://hf-mirror.com/tencent/Hy-MT2-1.8B-GGUF/resolve/main/Hy-MT2-1.8B-Q4_K_M.gguf -C -

  % Total    % Received % Xferd  Average Speed  Time    Time    Time   Current
                                 Dload  Upload  Total   Spent   Left   Speed
100   1365   0   1365   0      0   1408      0                              0
100  1.05G 100  1.05G   0      0  7.64M      0   02:21   02:21          8.39M

再到llama.cpp的github存储库，下载最新版本llama预编译可执行文件，选择vulkan版本，与cpu版本的区别就是多了一个56MB的ggml-vulkan.dll，它会自动检测显卡类型。

复制代码

C:\d>curl -LO https://github.com/ggml-org/llama.cpp/releases/download/b9279/llama-b9279-bin-win-cpu-x64.zip
  % Total    % Received % Xferd  Average Speed  Time    Time    Time   Current
                                 Dload  Upload  Total   Spent   Left   Speed
  0      0   0      0   0      0      0      0           00:03              0
 22 15.18M  22  3.35M   0      0  33555      0   07:54   01:44   06:10  30056^C
C:\d>curl -LO https://github.com/ggml-org/llama.cpp/releases/download/b9279/llama-b9279-bin-win-vulkan-x64.zip -C -
  % Total    % Received % Xferd  Average Speed  Time    Time    Time   Current
                                 Dload  Upload  Total   Spent   Left   Speed
  0      0   0      0   0      0      0      0                              0
100 31.17M 100 31.17M   0      0  35429      0   15:22   15:22          38437

为了看懂基准测试输出，摘录这里的参数含义

参数

Q4_0 是什么

Q4_0 是一种 4-bit 量化格式。它的意义不是"模型更强"，而是"模型更小、更省显存、更容易塞进更多设备里"。这些榜单大多统一用 Llama 2 7B, Q4_0，核心目的是减少变量，让不同 GPU 的成绩更容易横向比较。

pp512 是什么

pp512 一般可以理解为 prompt processing 512 tokens，也就是处理 512 个输入 token 时的吞吐。

pp = prompt processing

512 = 输入长度是 512 token

t/s = tokens per second

它更像"吃提示词的速度"，通常能并行得更充分，所以数字往往很高。

tg128 是什么

tg128 一般可以理解为 text generation 128 tokens，也就是连续生成 128 个 token 时的速度。

tg = text generation

128 = 连续生成 128 token

t/s = tokens per second

它更接近我们平时感受到的"模型回答快不快"。因为生成阶段是逐 token 递推，所以通常明显低于 pp512。

基准测试：

复制代码

C:\d\llama260522>llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf  -ngl 0
load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\d\llama260522\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| hunyuan-dense 1.8B Q4_K - Medium |   1.05 GiB |     1.79 B | Vulkan     |   0 |           pp512 |       592.38 ± 13.29 |
| hunyuan-dense 1.8B Q4_K - Medium |   1.05 GiB |     1.79 B | Vulkan     |   0 |           tg128 |         45.02 ± 0.42 |

build: 47c0eda9d (9279)

可见，它检测出了我的集成显卡AMD Radeon 780M Graphics。把ggml-vulkan.dll文件改名，重新执行，这次后台就是CPU，pp512减少了近一半，tg128保持不变。

复制代码

C:\d\llama260522>llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf  -ngl 0
load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll
load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| hunyuan-dense 1.8B Q4_K - Medium |   1.05 GiB |     1.79 B | CPU        |       8 |           pp512 |       339.36 ± 10.26 |
| hunyuan-dense 1.8B Q4_K - Medium |   1.05 GiB |     1.79 B | CPU        |       8 |           tg128 |         45.39 ± 0.11 |

build: 47c0eda9d (9279)

参阅文档，https://juejin.cn/post/7382216166486540339，了解到：

-ngl N, --n-gpu-layers N：

当使用GPU支持编译时，此选项允许将一些层卸载到GPU进行计算。

通常会提高性能。

现在这个参数为0，再恢复文件，去掉-ngl 0参数

复制代码

C:\d\llama260522>llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf
load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\d\llama260522\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| hunyuan-dense 1.8B Q4_K - Medium |   1.05 GiB |     1.79 B | Vulkan     |  99 |           pp512 |        844.50 ± 9.69 |
| hunyuan-dense 1.8B Q4_K - Medium |   1.05 GiB |     1.79 B | Vulkan     |  99 |           tg128 |         59.84 ± 0.31 |

build: 47c0eda9d (9279)

这次pp512和tg128都比Vulkan -ngl 0提升了30%。

运行一个 completion 示例

复制代码

C:\d\llama260522>llama-completion --model ..\Hy-MT2-1.8B-Q4_K_M.gguf  -p "Translate the following segment into Chinese, without additional explanation：Hello" --jinja -ngl 0 -n 64 -st
0.00.078.290 I llama_completion: llama backend init
0.00.078.296 I llama_completion: load the model and apply lora adapter, if any
0.00.078.303 I common_init_result: fitting params to device memory ...
0.00.078.304 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.408.458 W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
0.09.586.475 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.20.135.187 I llama_completion: llama threadpool init, n_threads = 8
0.20.136.500 I llama_completion: chat template is available, enabling conversation mode (disable it with -no-cnv)
0.20.136.506 W *** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
0.20.148.699 I llama_completion: chat template example:
<｜hy_begin▁of▁sentence｜>You are a helpful assistant<｜hy_place▁holder▁no▁3｜><｜hy_User｜>Hello<｜hy_Assistant｜>Hi there<｜hy_place▁holder▁no▁2｜><｜hy_User｜>How are you?<｜hy_Assistant｜>
0.20.148.709 I
0.20.149.456 I system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.20.149.458 I
0.20.161.695 I sampler seed: 3367966364
0.20.161.906 I sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 20, top_p = 0.800, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
0.20.162.115 I sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
0.20.162.118 I generate: n_ctx = 262144, n_batch = 2048, n_predict = 64, n_keep = 0
0.20.162.118 I
Translate the following segment into Chinese, without additional explanation��Hello你好 [end of text]


0.21.819.501 I common_perf_print:    sampling time =       0.64 ms
0.21.819.505 I common_perf_print:    samplers time =       0.09 ms /    17 tokens
0.21.819.506 I common_perf_print:        load time =   19767.05 ms
0.21.819.511 I common_perf_print: prompt eval time =    1611.80 ms /    15 tokens (  107.45 ms per token,     9.31 tokens per second)
0.21.819.513 I common_perf_print:        eval time =      36.48 ms /     1 runs   (   36.48 ms per token,    27.41 tokens per second)
0.21.819.514 I common_perf_print:       total time =    1684.68 ms /    16 tokens
0.21.819.515 I common_perf_print: unaccounted time =      35.77 ms /   2.1 %      (total - sampling - prompt eval - eval) / (total)
0.21.819.516 I common_perf_print:    graphs reused =          0

C:\d\llama260522>

用CLI测试, 不知为何，翻译了一句就退出。用读入文件的方法也一样，翻译了一句就退出。

复制代码

C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf

Loading model... /
C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf --jinja -ngl 0 -n 64 -st

Loading model...


build      : b9279-47c0eda9d
model      : Hy-MT2-1.8B-Q4_K_M.gguf
modalities : text


> 请将以下文本准确翻译为英文。

Please translate the text accurately into English.

[ Prompt: 9.9 t/s | Generation: 50.7 t/s ]

Exiting...

C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf --jinja -ngl 0 -n 64 -st

Loading model...


build      : b9279-47c0eda9d
model      : Hy-MT2-1.8B-Q4_K_M.gguf
modalities : text


> /read ..\eng.txt

Loaded text from '..\eng.txt'

> 译成中文

--- 文件：..\eng.txt ---
简要总结：Lance 是一种开放性的 Lakehouse 格式，专为 AI 工作负载设计。LanceDB 与 DuckDB Labs 合作，让您能够直接在 DuckDB SQL 中执行快速向量和混合搜索，而无需中断您的分析工作流

[ Prompt: 49.9 t/s | Generation: 43.1 t/s ]

Exiting...

补记：多次尝试，发现使用-ngl 10可以使用Vulkan不退出，查询帮助信息，得到：

-st, --single-turn run conversation for a single turn only, then exit when done

will not be interactive if first turn is predefined with --prompt

(default: false)去掉-st就能翻译多句。

复制代码

C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf -ngl 10

Loading model...
...
**未来趋势**

未来一年需要关注的是，针对PG19/PG20的列存储工作和TAM扩展生态是否会融合或分化。

社区方向普遍倾向于更高的可插拔性------更细粒度的TAM钩子，不需要每次都解析pgsql-hackers的修补的规划器集成点，以及真正能够反映自定义存储成本的估算机制。供应商方向（Snowflake、Databricks、Microsoft等，它们的Postgres产品底层都有定制存储层）则远离这一方向，因为他们的差异化优势存在于TAM线以下，可插拔性会稀释其护城河。

未来两年的架构思维走向将决定2028年"Postgres"的含义。我有个人偏好，你们可以猜测是什么。

目前的实际答案是：运行基准测试。这两个扩展都有足够稳定的版本可供测试。TAM时代已不再只是假设。

https://thebuild.com/blog/2026/05/20/table-access-methods-wake-up/

[ Prompt: 365.9 t/s | Generation: 32.1 t/s ]