LLM大模型推理加速(五): mlc-llm 教程,将qwen-7b 部署到手机上

MLC-LLM 是一种高性能通用部署解决方案,允许使用具有编译器加速功能的本机 API 来本机部署任何大型语言模型。该项目的使命是让每个人都能利用机器学习编译技术在每个人的设备上本地开发、优化和部署人工智能模型。

AMD GPU NVIDIA GPU Apple GPU Intel GPU
Linux / Win ✅ Vulkan, ROCm ✅ Vulkan, CUDA N/A ✅ Vulkan
macOS ✅ Metal (dGPU) N/A ✅ Metal ✅ Metal (iGPU)
Web Browser ✅ WebGPU and WASM
iOS / iPadOS ✅ Metal on Apple A-series GPU
Android ✅ OpenCL on Adreno GPU ✅ OpenCL on Mali GPU

支持的模型列表:

LLaMA Prebuilt Model LibraryMLC Implementation Llama-2-chat Code LlamaVicunaWizardLMWizardCoder (new)OpenOrca Platypus2FlagAlpha Llama-2 Chinesegeorgesung Llama-2 UncensoredAlpacaGuanacoOpenLLaMAGorillaYuLan-Chat
Mistral Prebuilt Model LibraryMLC Implementation Mistral-7B-Instruct-v0.2NeuralHermes-2.5-Mistral-7BOpenHermes-2.5-Mistral-7BWizardMath-7B-V1.1
GPT-NeoX Prebuilt Model LibraryMLC Implementation RedPajama DollyPythiaStableCode
GPTBigCode Prebuilt Model LibraryMLC Implementation StarCoderSantaCoderWizardCoder (old)
Phi Prebuilt Model LibraryMLC Implementation Phi-1_5Phi-2
GPT2 Prebuilt Model LibraryMLC Implementation GPT2

环境安装

lua 复制代码
conda create --name mlc python=3.11
conda activate mlc
​
python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu121 mlc-ai-nightly-cu121
python -c "import mlc_chat; print(mlc_chat)"

模型转换

模型转换分为两步:

  1. 转换模型权重
  2. 生成mlc chat 的配置

下面以qwen模型为例,暂时不支持qwen2

转换模型权重

bash 复制代码
mlc_chat convert_weight /home/chuan/models/qwen/Qwen-7B-Chat \
    --quantization q4f16_1 \
    -o /home/chuan/models/qwen/Qwen-7B-Chat/mlc

参数列表

--CONFIG

It can be one of the following:

  1. Path to a HuggingFace model directory that contains a config.json or
  2. Path to config.json in HuggingFace format, or
  3. The name of a pre-defined model architecture.

--quantization QUANTIZATION_MODE

可选项 q0f16, q0f32, q3f16_1, q4f16_1, q4f32_1, and q4f16_awq.推荐使用q4f16_1

--model-type MODEL_TYPE

Model architecture such as "llama". If not set, it is inferred from config.json

--device DEVICE

The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs if not specified.

--source SOURCE

The path to original model weight, infer from config if missing.

--source-format SOURCE_FORMAT

The format of source model weight, infer from config if missing.

--output OUTPUT

The output directory to save the quantized model weight. Will create params_shard_*.bin and ndarray-cache.json in this directory.

生成mlc chat 的配置

arduino 复制代码
mlc_chat gen_config /home/chuan/models/qwen/Qwen-7B-Chat \
    --quantization q4f16_1 --conv-template chatml \
    -o /home/chuan/models/qwen/Qwen-7B-Chat/mlc

注意conv-template的值参照github.com/mlc-ai/mlc-...

如果不包含你的模型也可以自定义,但是要从源代码重新编译mlc

gen_config的参数列表

--CONFIG

It can be one of the following:

  1. Path to a HuggingFace model directory that contains a config.json or
  2. Path to config.json in HuggingFace format, or
  3. The name of a pre-defined model architecture.

--quantization QUANTIZATION_MODE

--model-type MODEL_TYPE

--conv-template CONV_TEMPLATE

--context-window-size CONTEXT_WINDOW_SIZE

最大句子长度

--output OUTPUT

其他不重要的参数没有列出来

运行mlc

python 复制代码
from mlc_chat import ChatModule
​
cm = ChatModule(model="/home/chuan/models/qwen/Qwen1___5-7B-Chat/mlc")
print(cm.generate("hello"))

把qwen模型编译成android app

我已经编译好了一个版本的app, 欢迎下载试用,需要科学上网

github.com/night-is-yo...

  1. 首先安装android studio,下载ndk、cmake,如图所示:
  1. 设置环境变量
bash 复制代码
export ANDROID_NDK=/home/chuan/Android/Sdk/ndk/26.2.11394342
export TVM_NDK_CC=/home/chuan/Android/Sdk/ndk/26.2.11394342/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android34-clang
export TVM_HOME=/home/chuan/github/mlc-llm/3rdparty/tvm
export JAVA_HOME=/home/chuan/tools/jdk-17.0.10

android 的api级别对应表,因为我的手机是android 14 ,因此选择 34

developer.android.com/guide/topic...

  1. 下载mlc ,并编译模型
bash 复制代码
git clone --recursive https://github.com/mlc-ai/mlc-llm/
cd ./mlc-llm/
MODEL_NAME=/home/chuan/models/qwen/Qwen-7B-Chat
QUANTIZATION=q4f16_1
​
mlc_chat convert_weight $MODEL_NAME --quantization $QUANTIZATION -o $MODEL_NAME/mlc
mlc_chat gen_config $MODEL_NAME --quantization $QUANTIZATION \
  --conv-template chatml --context-window-size 768 -o $MODEL_NAME/mlc
mlc_chat compile $MODEL_NAME/mlc/mlc-chat-config.json \
    --device android -o $MODEL_NAME/mlc/Qwen-7B-Chat-${QUANTIZATION}-android.tar
  1. 将模型上传到huggingface
bash 复制代码
git clone https://huggingface.co/chuan-niy/qwen_q4f16_1
git config user.name chuan-niy
git config user.email 1500546481@qq.com
​
cd qwen_q4f16_1
cp /home/chuan/models/qwen/Qwen-7B-Chat/mlc/* ./
git add . && git commit -m "Add qwen model weights for android"
git push origin main
  1. 编译android 库
bash 复制代码
cd ./android/library
vim ./src/main/assets/app-config.json
json 复制代码
{
  "model_list": [
    {
      "model_url": "https://huggingface.co/chuan-niy/qwen_q4f16_1",
      "model_lib": "qwen_q4f16_1",
      "estimated_vram_bytes": 4348727787,
      "model_id": "Qwen-7B-Chat-hf-q4f16_1"
    }
  ],
  "model_lib_path_for_prepare_libs": {
    "Qwen-7B-Chat-hf-q4f16_1": "/home/chuan/models/qwen/Qwen-7B-Chat/mlc/Qwen-7B-Chat-q4f16_1-android.tar"
  }
}

在CMakeLists.txt中添加以下信息

vi CMakeLists.txt

bash 复制代码
set(JAVA_AWT_LIBRARY "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_JVM_LIBRARY "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_INCLUDE_PATH "/home/chuan/tools/jdk-17.0.10/include")
set(JAVA_INCLUDE_PATH2 "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_AWT_INCLUDE_PATH "/home/chuan/tools/jdk-17.0.10/include")

需要修改#ifdef TVM4J_ANDROID代码的地方

vi mlc-llm/3rdparty/tvm/jvm/native/src/main/native/org_apache_tvm_native_c_api.cc

arduino 复制代码
#ifdef TVM4J_ANDROID
    _jvm->AttachCurrentThread(reinterpret_cast<void**>(&env), nullptr);
#else
    _jvm->AttachCurrentThread(reinterpret_cast<void**>(&env), nullptr);

最后运行编译

bash 复制代码
./prepare_libs.sh
  1. 接着需要用android studio 打开android项目,连接手机调试,然后运行

在手机上运行结果如下:

相关推荐
Learn Beyond Limits8 小时前
多层循环神经网络|Multi-layer RNNs
人工智能·rnn·深度学习·神经网络·机器学习·自然语言处理·nlp
XLYcmy21 小时前
一个针对医疗RAG系统的数据窃取攻击工具
python·网络安全·ai·llm·agent·rag·ai安全
AI大模型..1 天前
数据洞察加速器:LLM Copilot 如何让 SQL 查询效率提升 50% 以上?
人工智能·langchain·llm·agent·llama
羊小猪~~1 天前
LLM--微调(Adapters,Prompt,Prefix)
算法·ai·大模型·llm·prompt·adapters·prefix
羊小猪~~1 天前
LLM--BERT架构解析
人工智能·深度学习·大模型·llm·nlp·bert·ai算法
Flying pigs~~1 天前
主流大模型介绍(GPT、Llama、ChatGLM、Qwen、deepseek)
gpt·chatgpt·llm·llama·moe·deepseek·混合专家模式
测试开发技术1 天前
自动生成用例:基于OCR+ LLM的设计方案(附落地指南)
自动化测试·软件测试·自动化·llm·ocr·测试用例·用例自动生成
带娃的IT创业者1 天前
期中总结:从神经元到 GPT——AI 架构全景回顾(Version B)
人工智能·gpt·深度学习·神经网络·架构·nlp·transformer
组合缺一2 天前
Solon AI Harness 首次发版
java·人工智能·ai·llm·agent·solon
羊小猪~~2 天前
LLM--SFT简介
python·考研·算法·ai·大模型·llm·微调