llama编译封装了一个最小翻译模型400M

复制代码
---
tags:
- translation
- hy-mt
- quant
- 1.25bit
- sherry
- gguf
language:
- multilingual
base_model:
- AngelSlim/Hy-MT1.5-1.8B-1.25bit
---

<p align="center">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/logos/angelslim_logo_light.png?raw=true">
    <img alt="AngelSlim" src="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/logos/angelslim_logo.png?raw=true" width=55%>
  </picture>
</p>

<h3 align="center">
Dedicated to building a more intuitive, comprehensive, and efficient LLMs compression toolkit.
</h3>

<p align="center">
  📱 <a href="https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-2bit-GGUF/resolve/main/Hy-MT-demo.apk?download=true">Android Demo</a>&nbsp;&nbsp; | &nbsp;&nbsp;
  📣 <a href="https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit">Weights</a>&nbsp;&nbsp; | &nbsp;&nbsp;
  ✒️ <a href="https://arxiv.org/abs/2601.07892">Sherry Paper (ACL 2026)</a>&nbsp;&nbsp; | &nbsp;&nbsp;
  📖 <a href="https://angelslim.readthedocs.io/">Documentation</a>&nbsp;&nbsp; | &nbsp;&nbsp;
  🤗 <a href="https://huggingface.co/AngelSlim">AngelSlim</a>&nbsp;&nbsp; | &nbsp;&nbsp;
  💬 <a href="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/angel_slim_wechat.png?raw=true">WeChat</a>
<br>
</p>

<p align="center">
  <img src="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/HYMT1.5/model_scores.png?raw=true" alt="model_scores" width="80%">
  <br>
  <em>Hy-MT1.5-1.8B translation quality scores. Source: <a href="https://arxiv.org/abs/2512.24092">HY-MT1.5 Technical Report</a></em>
</p>

## 📣 Latest News
- [26/05/08] **We have released STQ1_0 kernel for 1.25-bit model** and given a PR to llama.cpp [PR #22836](https://github.com/ggml-org/llama.cpp/pull/22836) ! If you have any questions or suggestions for STQ_0, welcome to comment under the PR !🔥🔥🔥
- [26/04/29] We have released **Hy-MT1.5-1.8B-2bit (574MB)** and **Hy-MT1.5-1.8B-1.25bit (440MB)**, on-device translation models supporting 33 languages, with both weights and GGUF formats available. We also have made an [Android Demo](https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-2bit-GGUF/resolve/main/Hy-MT-demo.apk?download=true) for you to try out. We invite you to give it a spin! 🔥🔥🔥
- [26/02/09] We have released HY-1.8B-2Bit, 2-bit on-device large language model.
- [26/01/13] We have released v0.3. We support the training and deployment of Eagle3 for all-scale LLMs/VLMs/Audio models. And we released **Sherry**, the hardware-efficient 1.25-bit quantization algorithm [[Paper]](https://arxiv.org/abs/2601.07892) | [[Code]](https://github.com/Tencent/AngelSlim/tree/sherry/Sherry)

For more detailed information, please refer to [[AngelSlim]](https://github.com/Tencent/AngelSlim) and [[HY-MT]](https://github.com/Tencent-Hunyuan/HY-MT)

## 🌟 Hy-MT1.5-1.8B-1.25bit-GGUF Key Features

- **World-Class Translation Quality** Hy-MT1.5-1.8B-1.25bit is built upon the Hy-MT1.5-1.8B foundation model, a specialized translation model developed by Tencent Hunyuan Team through a holistic multi-stage training pipeline integrating MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning. The base model natively supports **33 languages**, **5 dialects/minority languages**, and **1,056 translation directions**. With only 1.8B parameters, it comprehensively outperforms much larger open-source models (e.g., Tower-Plus-72B, Qwen3-32B) and mainstream commercial translation APIs (e.g., Microsoft Translator, Doubao Translator). For full details, please refer to the [HY-MT1.5-1.8B](https://huggingface.co/tencent/HY-MT1.5-1.8B) and [HY-MT1.5 Technical Report](https://arxiv.org/abs/2512.24092).

- **Sherry: Extreme 1.25-bit Quantization** This model employs [**Sherry**](https://arxiv.org/abs/2601.07892) (accepted at **ACL 2026**), a hardware-efficient ternary quantization framework. Sherry introduces a **3:4 fine-grained sparsity** strategy: for every 4 model weights, the 3 most important are stored in 1-bit ({-1, +1}), while the remaining 1 is zeroed out. This packs 4 weights into just 5 bits, achieving an effective **1.25-bit** width with power-of-two alignment, compressing the original 3.3GB FP16 model to just **440MB**, with minimal accuracy loss.

<p align="center">
  <img src="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/HYMT1.5/Sherry.png?raw=true" alt="Sherry" width="80%">
  <br>
  <em>Sherry fine-grained sparsity: for every 4 weights, the 3 most important are stored in 1-bit, and the remaining 1 is zeroed out.</em>
</p>

- **On-Device Deployment for the Most Phones** Paired with our custom **STQ kernel** designed specifically for mobile CPUs, the 1.25-bit model achieves perfect SIMD instruction set alignment. This means even ordinary phones with limited memory can run high-quality offline translation smoothly. No internet connection required, and your data never leaves the device.

## 📈 Translation Benchmarks

Performance comparison of different model sizes on the Flores-200 Chinese-Foreign mutual translation benchmark:

<p align="center">
  <img src="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/HYMT1.5/flores_model_size.png?raw=true" alt="flores_model_size" width="80%">
  <br>
  <em>Performance of different model sizes on the Flores-200 Chinese-Foreign mutual translation benchmark.</em>
</p>

## ⚡ Speed Demo

FP16 (8x speed) vs. 1.25-bit speed comparison.

<p align="center">
  <img src="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/HYMT1.5/fp16vs1.25bit.gif?raw=true" alt="fp16_vs_1.25bit" width="60%">
  <br>
  <em> Demo device: Snapdragon 888, 8GB RAM.</em>
</p>

## 📱 Demo

We provide a ready-to-use Android demo APK for offline translation. The app features a **background word extraction mode** that works across any app on your phone --- browse emails, webpages, or chat messages and get instant translations without switching apps. No network required, no data collection, one-time download for permanent use.

**Download Demo:**

https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit-GGUF/resolve/main/Hy-MT-demo.apk

### Translation Demo

<p align="center">
  <img src="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/HYMT1.5/app_demo.gif?raw=true" alt="app_demo" width="40%">
  <br>
  <em>Demo device: Snapdragon 865, 8GB RAM.</em>
</p>

### Background Word Extraction Mode

<p align="center">
  <img src="https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/HYMT1.5/demo2.gif?raw=true" alt="demo2" width="40%">
  <br>
  <em>Demo device: Snapdragon 7+ Gen 2, 16GB RAM.</em>
</p>

## 💻 Deployment

**❕❕ This gguf depends on our STQ kernel, which is released at [PR #22836](https://github.com/ggml-org/llama.cpp/pull/22836).**

### Clone llama.cpp

```bash
git clone https://github.com/ggml-org/llama.cpp.git
```

### Enter the llama.cpp folder

```bash
cd llama.cpp
```

### Fetch and check out the PR branch

```bash
git fetch origin pull/22836/head:pr-22836-stq_0
git checkout pr-22836-stq_0
```

### Build llama.cpp

```bash
cmake -B build
cmake --build build --config Release
```

### Download the GGUF model


```bash
pip install huggingface_hub
huggingface-cli download AngelSlim/Hy-MT1.5-1.8B-1.25bit-GGUF \
    --local-dir model_zoo/Hy-MT1.5-1.8B-1.25bit-GGUF
```


### Run a completion example

The prompt format can be viewed at [HY-MT1.5-1.8B](https://huggingface.co/tencent/HY-MT1.5-1.8B)

```bash
./build/bin/llama-completion \
  --model model_zoo/Hy-MT1.5-1.8B-1.25bit-GGUF/Hy-MT1.5-1.8B-1.25bit.gguf \
  -p "Translate the following segment into Chinese, without additional explanation：Hello" \
  --jinja \
  -ngl 0 \
  -n 64 -st 
```

### Run the llama.cpp benchmark

```bash
./build/bin/llama-bench -m model_zoo/Hy-MT1.5-1.8B-1.25bit-GGUF/Hy-MT1.5-1.8B-1.25bit.gguf -ngl 0
```


## 📥 Download Links

- 1.25-bit model weights: https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit
- 1.25-bit model GGUF: https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit-GGUF
- 2-bit model weights: https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-2bit
- 2-bit model GGUF: https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-2bit-GGUF
- Demo APK: https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit-GGUF/resolve/main/Hy-MT-demo.apk

## 📄 Technical Reports
- HY-MT1.5 Technical Report: https://arxiv.org/abs/2512.24092
- Sherry Paper (ACL 2026): https://arxiv.org/abs/2601.07892
- AngelSlim Technical Report: https://arxiv.org/abs/2602.21233

## 📝 License

The code for this project is open-sourced under the [License for AngelSlim](LICENSE).

## 🔗 Citation

```bibtex
@misc{huang2026sherry,
      title={Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification}, 
      author={Hong Huang and Decheng Wu and Qiangqiang Hu and Guanghua Yu and Jinhai Yang and Jianchen Zhu and Xue Liu and Dapeng Wu},
      year={2026},
      eprint={2601.07892},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.07892}, 
}

@article{angelslim2026,
  title={AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression},
  author={Hunyuan AI Infra Team},
  journal={arXiv preprint arXiv:2602.21233},
  year={2026}
}

@misc{zheng2025hymt,
      title={HY-MT1.5 Technical Report}, 
      author={Mao Zheng and Zheng Li and Tao Chen and Mingyang Song and Di Wang},
      year={2025},
      eprint={2512.24092},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.24092}, 
}
```

## 💬 Technical Discussion

* AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on [GitHub Issues](https://github.com/Tencent/AngelSlim/issues) or join our [WeChat discussion group](https://github.com/Tencent/AngelSlim/blob/main/docs/source/assets/angel_slim_wechat.png?raw=true).
Hy-MT1.5-1.8B-1.25bit-GGUF 混元翻译
支持手机
llama要用分支编译，才能加载模型推理
我编译的cpu的，推理速度跟电脑cpu，内存配置相关
可以本地离线用
有这种需求的可以下载
运行程序后，打开llama网页即可
网盘地址：通过网盘分享的文件：混元翻译离线本地版.7z
链接: https://pan.baidu.com/s/15blt94PgbLjWQZWMoB9Qnw 提取码: hebe