LLM大模型推理加速（五）: mlc-llm 教程，将qwen-7b 部署到手机上

MLC-LLM 是一种高性能通用部署解决方案，允许使用具有编译器加速功能的本机 API 来本机部署任何大型语言模型。该项目的使命是让每个人都能利用机器学习编译技术在每个人的设备上本地开发、优化和部署人工智能模型。

	AMD GPU	NVIDIA GPU	Apple GPU	Intel GPU
Linux / Win	✅ Vulkan, ROCm	✅ Vulkan, CUDA	N/A	✅ Vulkan
macOS	✅ Metal (dGPU)	N/A	✅ Metal	✅ Metal (iGPU)
Web Browser	✅ WebGPU and WASM
iOS / iPadOS	✅ Metal on Apple A-series GPU
Android	✅ OpenCL on Adreno GPU		✅ OpenCL on Mali GPU

支持的模型列表：


LLaMA	Prebuilt Model Library MLC Implementation	Llama-2-chat	Code Llama Vicuna WizardLM WizardCoder (new)OpenOrca Platypus2 FlagAlpha Llama-2 Chinese georgesung Llama-2 Uncensored Alpaca Guanaco OpenLLaMA Gorilla YuLan-Chat
Mistral	Prebuilt Model Library MLC Implementation	Mistral-7B-Instruct-v0.2 NeuralHermes-2.5-Mistral-7B OpenHermes-2.5-Mistral-7B WizardMath-7B-V1.1
GPT-NeoX	Prebuilt Model Library MLC Implementation	RedPajama	Dolly Pythia StableCode
GPTBigCode	Prebuilt Model Library MLC Implementation		StarCoder SantaCoder WizardCoder (old)
Phi	Prebuilt Model Library MLC Implementation	Phi-1_5 Phi-2
GPT2	Prebuilt Model Library MLC Implementation	GPT2

环境安装

lua 复制代码

conda create --name mlc python=3.11
conda activate mlc

python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu121 mlc-ai-nightly-cu121
python -c "import mlc_chat; print(mlc_chat)"

模型转换

模型转换分为两步：

转换模型权重
生成mlc chat 的配置

下面以qwen模型为例，暂时不支持qwen2

转换模型权重

bash 复制代码

mlc_chat convert_weight /home/chuan/models/qwen/Qwen-7B-Chat \
    --quantization q4f16_1 \
    -o /home/chuan/models/qwen/Qwen-7B-Chat/mlc

参数列表

--CONFIG

It can be one of the following:

Path to a HuggingFace model directory that contains a config.json or
Path to config.json in HuggingFace format, or
The name of a pre-defined model architecture.

--quantization QUANTIZATION_MODE

可选项 q0f16, q0f32, q3f16_1, q4f16_1, q4f32_1, and q4f16_awq.推荐使用q4f16_1

--model-type MODEL_TYPE

Model architecture such as "llama". If not set, it is inferred from config.json

--device DEVICE

The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs if not specified.

--source SOURCE

The path to original model weight, infer from config if missing.

--source-format SOURCE_FORMAT

The format of source model weight, infer from config if missing.

--output OUTPUT

The output directory to save the quantized model weight. Will create params_shard_*.bin and ndarray-cache.json in this directory.

生成mlc chat 的配置

arduino 复制代码

mlc_chat gen_config /home/chuan/models/qwen/Qwen-7B-Chat \
    --quantization q4f16_1 --conv-template chatml \
    -o /home/chuan/models/qwen/Qwen-7B-Chat/mlc

注意conv-template的值参照github.com/mlc-ai/mlc-...，

如果不包含你的模型也可以自定义，但是要从源代码重新编译mlc

gen_config的参数列表

--CONFIG

It can be one of the following:

Path to a HuggingFace model directory that contains a config.json or
Path to config.json in HuggingFace format, or
The name of a pre-defined model architecture.

--quantization QUANTIZATION_MODE

--model-type MODEL_TYPE

--conv-template CONV_TEMPLATE

--context-window-size CONTEXT_WINDOW_SIZE

最大句子长度

--output OUTPUT

其他不重要的参数没有列出来

运行mlc

python 复制代码

from mlc_chat import ChatModule

cm = ChatModule(model="/home/chuan/models/qwen/Qwen1___5-7B-Chat/mlc")
print(cm.generate("hello"))

把qwen模型编译成android app

我已经编译好了一个版本的app, 欢迎下载试用，需要科学上网

github.com/night-is-yo...

首先安装android studio，下载ndk、cmake,如图所示：

设置环境变量

bash 复制代码

export ANDROID_NDK=/home/chuan/Android/Sdk/ndk/26.2.11394342
export TVM_NDK_CC=/home/chuan/Android/Sdk/ndk/26.2.11394342/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android34-clang
export TVM_HOME=/home/chuan/github/mlc-llm/3rdparty/tvm
export JAVA_HOME=/home/chuan/tools/jdk-17.0.10

android 的api级别对应表，因为我的手机是android 14 ，因此选择 34

developer.android.com/guide/topic...

下载mlc ,并编译模型

bash 复制代码

git clone --recursive https://github.com/mlc-ai/mlc-llm/
cd ./mlc-llm/
MODEL_NAME=/home/chuan/models/qwen/Qwen-7B-Chat
QUANTIZATION=q4f16_1

mlc_chat convert_weight $MODEL_NAME --quantization $QUANTIZATION -o $MODEL_NAME/mlc
mlc_chat gen_config $MODEL_NAME --quantization $QUANTIZATION \
  --conv-template chatml --context-window-size 768 -o $MODEL_NAME/mlc
mlc_chat compile $MODEL_NAME/mlc/mlc-chat-config.json \
    --device android -o $MODEL_NAME/mlc/Qwen-7B-Chat-${QUANTIZATION}-android.tar

将模型上传到huggingface

bash 复制代码

git clone https://huggingface.co/chuan-niy/qwen_q4f16_1
git config user.name chuan-niy
git config user.email 1500546481@qq.com

cd qwen_q4f16_1
cp /home/chuan/models/qwen/Qwen-7B-Chat/mlc/* ./
git add . && git commit -m "Add qwen model weights for android"
git push origin main

编译android 库

bash 复制代码

cd ./android/library
vim ./src/main/assets/app-config.json

json 复制代码

{
  "model_list": [
    {
      "model_url": "https://huggingface.co/chuan-niy/qwen_q4f16_1",
      "model_lib": "qwen_q4f16_1",
      "estimated_vram_bytes": 4348727787,
      "model_id": "Qwen-7B-Chat-hf-q4f16_1"
    }
  ],
  "model_lib_path_for_prepare_libs": {
    "Qwen-7B-Chat-hf-q4f16_1": "/home/chuan/models/qwen/Qwen-7B-Chat/mlc/Qwen-7B-Chat-q4f16_1-android.tar"
  }
}

在CMakeLists.txt中添加以下信息

vi CMakeLists.txt

bash 复制代码

set(JAVA_AWT_LIBRARY "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_JVM_LIBRARY "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_INCLUDE_PATH "/home/chuan/tools/jdk-17.0.10/include")
set(JAVA_INCLUDE_PATH2 "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_AWT_INCLUDE_PATH "/home/chuan/tools/jdk-17.0.10/include")

需要修改#ifdef TVM4J_ANDROID代码的地方

vi mlc-llm/3rdparty/tvm/jvm/native/src/main/native/org_apache_tvm_native_c_api.cc

arduino 复制代码

#ifdef TVM4J_ANDROID
    _jvm->AttachCurrentThread(reinterpret_cast<void**>(&env), nullptr);
#else
    _jvm->AttachCurrentThread(reinterpret_cast<void**>(&env), nullptr);

最后运行编译

bash 复制代码

./prepare_libs.sh

接着需要用android studio 打开android项目，连接手机调试，然后运行

在手机上运行结果如下：