LLM大模型推理加速(五): mlc-llm 教程,将qwen-7b 部署到手机上

MLC-LLM 是一种高性能通用部署解决方案,允许使用具有编译器加速功能的本机 API 来本机部署任何大型语言模型。该项目的使命是让每个人都能利用机器学习编译技术在每个人的设备上本地开发、优化和部署人工智能模型。

AMD GPU NVIDIA GPU Apple GPU Intel GPU
Linux / Win ✅ Vulkan, ROCm ✅ Vulkan, CUDA N/A ✅ Vulkan
macOS ✅ Metal (dGPU) N/A ✅ Metal ✅ Metal (iGPU)
Web Browser ✅ WebGPU and WASM
iOS / iPadOS ✅ Metal on Apple A-series GPU
Android ✅ OpenCL on Adreno GPU ✅ OpenCL on Mali GPU

支持的模型列表:

LLaMA Prebuilt Model LibraryMLC Implementation Llama-2-chat Code LlamaVicunaWizardLMWizardCoder (new)OpenOrca Platypus2FlagAlpha Llama-2 Chinesegeorgesung Llama-2 UncensoredAlpacaGuanacoOpenLLaMAGorillaYuLan-Chat
Mistral Prebuilt Model LibraryMLC Implementation Mistral-7B-Instruct-v0.2NeuralHermes-2.5-Mistral-7BOpenHermes-2.5-Mistral-7BWizardMath-7B-V1.1
GPT-NeoX Prebuilt Model LibraryMLC Implementation RedPajama DollyPythiaStableCode
GPTBigCode Prebuilt Model LibraryMLC Implementation StarCoderSantaCoderWizardCoder (old)
Phi Prebuilt Model LibraryMLC Implementation Phi-1_5Phi-2
GPT2 Prebuilt Model LibraryMLC Implementation GPT2

环境安装

lua 复制代码
conda create --name mlc python=3.11
conda activate mlc
​
python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu121 mlc-ai-nightly-cu121
python -c "import mlc_chat; print(mlc_chat)"

模型转换

模型转换分为两步:

  1. 转换模型权重
  2. 生成mlc chat 的配置

下面以qwen模型为例,暂时不支持qwen2

转换模型权重

bash 复制代码
mlc_chat convert_weight /home/chuan/models/qwen/Qwen-7B-Chat \
    --quantization q4f16_1 \
    -o /home/chuan/models/qwen/Qwen-7B-Chat/mlc

参数列表

--CONFIG

It can be one of the following:

  1. Path to a HuggingFace model directory that contains a config.json or
  2. Path to config.json in HuggingFace format, or
  3. The name of a pre-defined model architecture.

--quantization QUANTIZATION_MODE

可选项 q0f16, q0f32, q3f16_1, q4f16_1, q4f32_1, and q4f16_awq.推荐使用q4f16_1

--model-type MODEL_TYPE

Model architecture such as "llama". If not set, it is inferred from config.json

--device DEVICE

The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs if not specified.

--source SOURCE

The path to original model weight, infer from config if missing.

--source-format SOURCE_FORMAT

The format of source model weight, infer from config if missing.

--output OUTPUT

The output directory to save the quantized model weight. Will create params_shard_*.bin and ndarray-cache.json in this directory.

生成mlc chat 的配置

arduino 复制代码
mlc_chat gen_config /home/chuan/models/qwen/Qwen-7B-Chat \
    --quantization q4f16_1 --conv-template chatml \
    -o /home/chuan/models/qwen/Qwen-7B-Chat/mlc

注意conv-template的值参照github.com/mlc-ai/mlc-...

如果不包含你的模型也可以自定义,但是要从源代码重新编译mlc

gen_config的参数列表

--CONFIG

It can be one of the following:

  1. Path to a HuggingFace model directory that contains a config.json or
  2. Path to config.json in HuggingFace format, or
  3. The name of a pre-defined model architecture.

--quantization QUANTIZATION_MODE

--model-type MODEL_TYPE

--conv-template CONV_TEMPLATE

--context-window-size CONTEXT_WINDOW_SIZE

最大句子长度

--output OUTPUT

其他不重要的参数没有列出来

运行mlc

python 复制代码
from mlc_chat import ChatModule
​
cm = ChatModule(model="/home/chuan/models/qwen/Qwen1___5-7B-Chat/mlc")
print(cm.generate("hello"))

把qwen模型编译成android app

我已经编译好了一个版本的app, 欢迎下载试用,需要科学上网

github.com/night-is-yo...

  1. 首先安装android studio,下载ndk、cmake,如图所示:
  1. 设置环境变量
bash 复制代码
export ANDROID_NDK=/home/chuan/Android/Sdk/ndk/26.2.11394342
export TVM_NDK_CC=/home/chuan/Android/Sdk/ndk/26.2.11394342/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android34-clang
export TVM_HOME=/home/chuan/github/mlc-llm/3rdparty/tvm
export JAVA_HOME=/home/chuan/tools/jdk-17.0.10

android 的api级别对应表,因为我的手机是android 14 ,因此选择 34

developer.android.com/guide/topic...

  1. 下载mlc ,并编译模型
bash 复制代码
git clone --recursive https://github.com/mlc-ai/mlc-llm/
cd ./mlc-llm/
MODEL_NAME=/home/chuan/models/qwen/Qwen-7B-Chat
QUANTIZATION=q4f16_1
​
mlc_chat convert_weight $MODEL_NAME --quantization $QUANTIZATION -o $MODEL_NAME/mlc
mlc_chat gen_config $MODEL_NAME --quantization $QUANTIZATION \
  --conv-template chatml --context-window-size 768 -o $MODEL_NAME/mlc
mlc_chat compile $MODEL_NAME/mlc/mlc-chat-config.json \
    --device android -o $MODEL_NAME/mlc/Qwen-7B-Chat-${QUANTIZATION}-android.tar
  1. 将模型上传到huggingface
bash 复制代码
git clone https://huggingface.co/chuan-niy/qwen_q4f16_1
git config user.name chuan-niy
git config user.email 1500546481@qq.com
​
cd qwen_q4f16_1
cp /home/chuan/models/qwen/Qwen-7B-Chat/mlc/* ./
git add . && git commit -m "Add qwen model weights for android"
git push origin main
  1. 编译android 库
bash 复制代码
cd ./android/library
vim ./src/main/assets/app-config.json
json 复制代码
{
  "model_list": [
    {
      "model_url": "https://huggingface.co/chuan-niy/qwen_q4f16_1",
      "model_lib": "qwen_q4f16_1",
      "estimated_vram_bytes": 4348727787,
      "model_id": "Qwen-7B-Chat-hf-q4f16_1"
    }
  ],
  "model_lib_path_for_prepare_libs": {
    "Qwen-7B-Chat-hf-q4f16_1": "/home/chuan/models/qwen/Qwen-7B-Chat/mlc/Qwen-7B-Chat-q4f16_1-android.tar"
  }
}

在CMakeLists.txt中添加以下信息

vi CMakeLists.txt

bash 复制代码
set(JAVA_AWT_LIBRARY "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_JVM_LIBRARY "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_INCLUDE_PATH "/home/chuan/tools/jdk-17.0.10/include")
set(JAVA_INCLUDE_PATH2 "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_AWT_INCLUDE_PATH "/home/chuan/tools/jdk-17.0.10/include")

需要修改#ifdef TVM4J_ANDROID代码的地方

vi mlc-llm/3rdparty/tvm/jvm/native/src/main/native/org_apache_tvm_native_c_api.cc

arduino 复制代码
#ifdef TVM4J_ANDROID
    _jvm->AttachCurrentThread(reinterpret_cast<void**>(&env), nullptr);
#else
    _jvm->AttachCurrentThread(reinterpret_cast<void**>(&env), nullptr);

最后运行编译

bash 复制代码
./prepare_libs.sh
  1. 接着需要用android studio 打开android项目,连接手机调试,然后运行

在手机上运行结果如下:

相关推荐
AI大模型1 天前
GitHub 狂飙 72k Star,这本大模型书凭啥能圈粉无数?
程序员·llm·agent
堆栈future1 天前
秒级生成4K图!字节豆包Seedream 4.0实测:完爆GPT-4o和Nano Banana
llm·aigc
大模型教程1 天前
小白学大模型:从零搭建LLaMA
程序员·llm·llama
AI大模型1 天前
一篇文章看懂RAG + 实战,看不懂来揍我
程序员·llm·agent
聚客AI1 天前
🙋‍♀️Transformer训练与推理全流程:从输入处理到输出生成
人工智能·算法·llm
智泊AI1 天前
Transformer之词嵌入 | 为什么要做词嵌入?
llm
库森学长1 天前
一文带你 "看见" MCP 的过程,彻底理解 MCP 的概念
llm·ai编程·mcp
MrSYJ1 天前
Chat Memory你知道怎么用吗
llm·openai·ai编程
Baihai_IDP2 天前
AI Agents 能自己开发工具自己使用吗?一项智能体自迭代能力研究
人工智能·面试·llm
大模型教程2 天前
8GB显存笔记本能跑多大AI模型?这个计算公式90%的人都不知道!
程序员·llm·agent