利用llama-vulkan版本测试腾讯混元Hy-MT2多语言翻译模型

先到hf-mirror网站下载GGUF格式模型,https://hf-mirror.com/tencent/Hy-MT2-1.8B-GGUF/tree/main, modelscope网站还未提供此格式, https://modelscope.cn/models/Tencent-Hunyuan/Hy-MT2-1.8B

下载如下文件:

复制代码
C:\d>curl -LO https://hf-mirror.com/tencent/Hy-MT2-1.8B-GGUF/resolve/main/Hy-MT2-1.8B-Q4_K_M.gguf -C -

  % Total    % Received % Xferd  Average Speed  Time    Time    Time   Current
                                 Dload  Upload  Total   Spent   Left   Speed
100   1365   0   1365   0      0   1408      0                              0
100  1.05G 100  1.05G   0      0  7.64M      0   02:21   02:21          8.39M

再到llama.cpp的github存储库,下载最新版本llama预编译可执行文件,选择vulkan版本,与cpu版本的区别就是多了一个56MB的ggml-vulkan.dll,它会自动检测显卡类型。

复制代码
C:\d>curl -LO https://github.com/ggml-org/llama.cpp/releases/download/b9279/llama-b9279-bin-win-cpu-x64.zip
  % Total    % Received % Xferd  Average Speed  Time    Time    Time   Current
                                 Dload  Upload  Total   Spent   Left   Speed
  0      0   0      0   0      0      0      0           00:03              0
 22 15.18M  22  3.35M   0      0  33555      0   07:54   01:44   06:10  30056^C
C:\d>curl -LO https://github.com/ggml-org/llama.cpp/releases/download/b9279/llama-b9279-bin-win-vulkan-x64.zip -C -
  % Total    % Received % Xferd  Average Speed  Time    Time    Time   Current
                                 Dload  Upload  Total   Spent   Left   Speed
  0      0   0      0   0      0      0      0                              0
100 31.17M 100 31.17M   0      0  35429      0   15:22   15:22          38437

为了看懂基准测试输出,摘录这里的参数含义

参数

Q4_0 是什么

Q4_0 是一种 4-bit 量化格式。它的意义不是"模型更强",而是"模型更小、更省显存、更容易塞进更多设备里"。这些榜单大多统一用 Llama 2 7B, Q4_0,核心目的是减少变量,让不同 GPU 的成绩更容易横向比较。

pp512 是什么

pp512 一般可以理解为 prompt processing 512 tokens,也就是处理 512 个输入 token 时的吞吐。

pp = prompt processing

512 = 输入长度是 512 token

t/s = tokens per second

它更像"吃提示词的速度",通常能并行得更充分,所以数字往往很高。

tg128 是什么

tg128 一般可以理解为 text generation 128 tokens,也就是连续生成 128 个 token 时的速度。

tg = text generation

128 = 连续生成 128 token

t/s = tokens per second

它更接近我们平时感受到的"模型回答快不快"。因为生成阶段是逐 token 递推,所以通常明显低于 pp512。

基准测试:

复制代码
C:\d\llama260522>llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf  -ngl 0
load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\d\llama260522\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| hunyuan-dense 1.8B Q4_K - Medium |   1.05 GiB |     1.79 B | Vulkan     |   0 |           pp512 |       592.38 ± 13.29 |
| hunyuan-dense 1.8B Q4_K - Medium |   1.05 GiB |     1.79 B | Vulkan     |   0 |           tg128 |         45.02 ± 0.42 |

build: 47c0eda9d (9279)

可见,它检测出了我的集成显卡AMD Radeon 780M Graphics。把ggml-vulkan.dll文件改名,重新执行,这次后台就是CPU,pp512减少了近一半,tg128保持不变。

复制代码
C:\d\llama260522>llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf  -ngl 0
load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll
load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| hunyuan-dense 1.8B Q4_K - Medium |   1.05 GiB |     1.79 B | CPU        |       8 |           pp512 |       339.36 ± 10.26 |
| hunyuan-dense 1.8B Q4_K - Medium |   1.05 GiB |     1.79 B | CPU        |       8 |           tg128 |         45.39 ± 0.11 |

build: 47c0eda9d (9279)

参阅文档,https://juejin.cn/post/7382216166486540339,了解到:

-ngl N, --n-gpu-layers N:

当使用GPU支持编译时,此选项允许将一些层卸载到GPU进行计算。

通常会提高性能。

现在这个参数为0,再恢复文件,去掉-ngl 0参数

复制代码
C:\d\llama260522>llama-bench -m ..\Hy-MT2-1.8B-Q4_K_M.gguf
load_backend: loaded RPC backend from C:\d\llama260522\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 780M Graphics (AMD proprietary driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
load_backend: loaded Vulkan backend from C:\d\llama260522\ggml-vulkan.dll
load_backend: loaded CPU backend from C:\d\llama260522\ggml-cpu-zen4.dll
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| hunyuan-dense 1.8B Q4_K - Medium |   1.05 GiB |     1.79 B | Vulkan     |  99 |           pp512 |        844.50 ± 9.69 |
| hunyuan-dense 1.8B Q4_K - Medium |   1.05 GiB |     1.79 B | Vulkan     |  99 |           tg128 |         59.84 ± 0.31 |

build: 47c0eda9d (9279)

这次pp512和tg128都比Vulkan -ngl 0提升了30%。

运行一个 completion 示例

复制代码
C:\d\llama260522>llama-completion --model ..\Hy-MT2-1.8B-Q4_K_M.gguf  -p "Translate the following segment into Chinese, without additional explanation:Hello" --jinja -ngl 0 -n 64 -st
0.00.078.290 I llama_completion: llama backend init
0.00.078.296 I llama_completion: load the model and apply lora adapter, if any
0.00.078.303 I common_init_result: fitting params to device memory ...
0.00.078.304 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.408.458 W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
0.09.586.475 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.20.135.187 I llama_completion: llama threadpool init, n_threads = 8
0.20.136.500 I llama_completion: chat template is available, enabling conversation mode (disable it with -no-cnv)
0.20.136.506 W *** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
0.20.148.699 I llama_completion: chat template example:
<|hy_begin▁of▁sentence|>You are a helpful assistant<|hy_place▁holder▁no▁3|><|hy_User|>Hello<|hy_Assistant|>Hi there<|hy_place▁holder▁no▁2|><|hy_User|>How are you?<|hy_Assistant|>
0.20.148.709 I
0.20.149.456 I system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.20.149.458 I
0.20.161.695 I sampler seed: 3367966364
0.20.161.906 I sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 20, top_p = 0.800, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.700
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000, adaptive_target = -1.000, adaptive_decay = 0.900
0.20.162.115 I sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist
0.20.162.118 I generate: n_ctx = 262144, n_batch = 2048, n_predict = 64, n_keep = 0
0.20.162.118 I
Translate the following segment into Chinese, without additional explanation��Hello你好 [end of text]


0.21.819.501 I common_perf_print:    sampling time =       0.64 ms
0.21.819.505 I common_perf_print:    samplers time =       0.09 ms /    17 tokens
0.21.819.506 I common_perf_print:        load time =   19767.05 ms
0.21.819.511 I common_perf_print: prompt eval time =    1611.80 ms /    15 tokens (  107.45 ms per token,     9.31 tokens per second)
0.21.819.513 I common_perf_print:        eval time =      36.48 ms /     1 runs   (   36.48 ms per token,    27.41 tokens per second)
0.21.819.514 I common_perf_print:       total time =    1684.68 ms /    16 tokens
0.21.819.515 I common_perf_print: unaccounted time =      35.77 ms /   2.1 %      (total - sampling - prompt eval - eval) / (total)
0.21.819.516 I common_perf_print:    graphs reused =          0

C:\d\llama260522>

用CLI测试, 不知为何,翻译了一句就退出。用读入文件的方法也一样,翻译了一句就退出。

复制代码
C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf

Loading model... /
C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf --jinja -ngl 0 -n 64 -st

Loading model...


build      : b9279-47c0eda9d
model      : Hy-MT2-1.8B-Q4_K_M.gguf
modalities : text


> 请将以下文本准确翻译为英文。

Please translate the text accurately into English.

[ Prompt: 9.9 t/s | Generation: 50.7 t/s ]

Exiting...

C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf --jinja -ngl 0 -n 64 -st

Loading model...


build      : b9279-47c0eda9d
model      : Hy-MT2-1.8B-Q4_K_M.gguf
modalities : text


> /read ..\eng.txt

Loaded text from '..\eng.txt'

> 译成中文

--- 文件:..\eng.txt ---
简要总结:Lance 是一种开放性的 Lakehouse 格式,专为 AI 工作负载设计。LanceDB 与 DuckDB Labs 合作,让您能够直接在 DuckDB SQL 中执行快速向量和混合搜索,而无需中断您的分析工作流

[ Prompt: 49.9 t/s | Generation: 43.1 t/s ]

Exiting...

补记:多次尝试,发现使用-ngl 10可以使用Vulkan不退出,查询帮助信息,得到:

-st, --single-turn run conversation for a single turn only, then exit when done

will not be interactive if first turn is predefined with --prompt

(default: false)去掉-st就能翻译多句。

复制代码
C:\d\llama260522>llama-cli -m ..\Hy-MT2-1.8B-Q4_K_M.gguf -ngl 10

Loading model...
...
**未来趋势**

未来一年需要关注的是,针对PG19/PG20的列存储工作和TAM扩展生态是否会融合或分化。

社区方向普遍倾向于更高的可插拔性------更细粒度的TAM钩子,不需要每次都解析pgsql-hackers的修补的规划器集成点,以及真正能够反映自定义存储成本的估算机制。供应商方向(Snowflake、Databricks、Microsoft等,它们的Postgres产品底层都有定制存储层)则远离这一方向,因为他们的差异化优势存在于TAM线以下,可插拔性会稀释其护城河。

未来两年的架构思维走向将决定2028年"Postgres"的含义。我有个人偏好,你们可以猜测是什么。

目前的实际答案是:运行基准测试。这两个扩展都有足够稳定的版本可供测试。TAM时代已不再只是假设。

https://thebuild.com/blog/2026/05/20/table-access-methods-wake-up/

[ Prompt: 365.9 t/s | Generation: 32.1 t/s ]
相关推荐
LaughingZhu2 小时前
Product Hunt 每日热榜 | 2026-05-22
人工智能·经验分享·深度学习·神经网络·产品运营
LCG元2 小时前
大模型微调指南:从数据处理到工业落地全解析
人工智能·语言模型
嗝o゚2 小时前
昇腾CANN ops-cv 仓:昇腾NPU上的目标检测算子实战
人工智能·目标检测·目标跟踪·npu·cann
互联圈运营观察2 小时前
Google I/O 2026之外,声网搞定弱网通话难题
人工智能
落日屿星辰2 小时前
ops-cv - 让计算机视觉“看得快“
人工智能·计算机视觉
数学建模导师2 小时前
2026电工杯A题电—氢—氨”耦合系统完整版解答含论文!
大数据·人工智能·数学建模
GEO从入门到精通2 小时前
GEO学习书籍或文章推荐哪本?
人工智能·学习
陌陌龙2 小时前
Sub2API 源码技术分析与搭建教程:把 AI 订阅变成可管理的 API 网关
人工智能
老虎海子2 小时前
从零入门 OpenAI Codex|登录、权限、终端、记忆配置全实操
人工智能·vscode·自然语言处理·chatgpt·个人开发·业界资讯