LLM大模型推理加速(五): mlc-llm 教程,将qwen-7b 部署到手机上

MLC-LLM 是一种高性能通用部署解决方案,允许使用具有编译器加速功能的本机 API 来本机部署任何大型语言模型。该项目的使命是让每个人都能利用机器学习编译技术在每个人的设备上本地开发、优化和部署人工智能模型。

AMD GPU NVIDIA GPU Apple GPU Intel GPU
Linux / Win ✅ Vulkan, ROCm ✅ Vulkan, CUDA N/A ✅ Vulkan
macOS ✅ Metal (dGPU) N/A ✅ Metal ✅ Metal (iGPU)
Web Browser ✅ WebGPU and WASM
iOS / iPadOS ✅ Metal on Apple A-series GPU
Android ✅ OpenCL on Adreno GPU ✅ OpenCL on Mali GPU

支持的模型列表:

LLaMA Prebuilt Model LibraryMLC Implementation Llama-2-chat Code LlamaVicunaWizardLMWizardCoder (new)OpenOrca Platypus2FlagAlpha Llama-2 Chinesegeorgesung Llama-2 UncensoredAlpacaGuanacoOpenLLaMAGorillaYuLan-Chat
Mistral Prebuilt Model LibraryMLC Implementation Mistral-7B-Instruct-v0.2NeuralHermes-2.5-Mistral-7BOpenHermes-2.5-Mistral-7BWizardMath-7B-V1.1
GPT-NeoX Prebuilt Model LibraryMLC Implementation RedPajama DollyPythiaStableCode
GPTBigCode Prebuilt Model LibraryMLC Implementation StarCoderSantaCoderWizardCoder (old)
Phi Prebuilt Model LibraryMLC Implementation Phi-1_5Phi-2
GPT2 Prebuilt Model LibraryMLC Implementation GPT2

环境安装

lua 复制代码
conda create --name mlc python=3.11
conda activate mlc
​
python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu121 mlc-ai-nightly-cu121
python -c "import mlc_chat; print(mlc_chat)"

模型转换

模型转换分为两步:

  1. 转换模型权重
  2. 生成mlc chat 的配置

下面以qwen模型为例,暂时不支持qwen2

转换模型权重

bash 复制代码
mlc_chat convert_weight /home/chuan/models/qwen/Qwen-7B-Chat \
    --quantization q4f16_1 \
    -o /home/chuan/models/qwen/Qwen-7B-Chat/mlc

参数列表

--CONFIG

It can be one of the following:

  1. Path to a HuggingFace model directory that contains a config.json or
  2. Path to config.json in HuggingFace format, or
  3. The name of a pre-defined model architecture.

--quantization QUANTIZATION_MODE

可选项 q0f16, q0f32, q3f16_1, q4f16_1, q4f32_1, and q4f16_awq.推荐使用q4f16_1

--model-type MODEL_TYPE

Model architecture such as "llama". If not set, it is inferred from config.json

--device DEVICE

The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs if not specified.

--source SOURCE

The path to original model weight, infer from config if missing.

--source-format SOURCE_FORMAT

The format of source model weight, infer from config if missing.

--output OUTPUT

The output directory to save the quantized model weight. Will create params_shard_*.bin and ndarray-cache.json in this directory.

生成mlc chat 的配置

arduino 复制代码
mlc_chat gen_config /home/chuan/models/qwen/Qwen-7B-Chat \
    --quantization q4f16_1 --conv-template chatml \
    -o /home/chuan/models/qwen/Qwen-7B-Chat/mlc

注意conv-template的值参照github.com/mlc-ai/mlc-...

如果不包含你的模型也可以自定义,但是要从源代码重新编译mlc

gen_config的参数列表

--CONFIG

It can be one of the following:

  1. Path to a HuggingFace model directory that contains a config.json or
  2. Path to config.json in HuggingFace format, or
  3. The name of a pre-defined model architecture.

--quantization QUANTIZATION_MODE

--model-type MODEL_TYPE

--conv-template CONV_TEMPLATE

--context-window-size CONTEXT_WINDOW_SIZE

最大句子长度

--output OUTPUT

其他不重要的参数没有列出来

运行mlc

python 复制代码
from mlc_chat import ChatModule
​
cm = ChatModule(model="/home/chuan/models/qwen/Qwen1___5-7B-Chat/mlc")
print(cm.generate("hello"))

把qwen模型编译成android app

我已经编译好了一个版本的app, 欢迎下载试用,需要科学上网

github.com/night-is-yo...

  1. 首先安装android studio,下载ndk、cmake,如图所示:
  1. 设置环境变量
bash 复制代码
export ANDROID_NDK=/home/chuan/Android/Sdk/ndk/26.2.11394342
export TVM_NDK_CC=/home/chuan/Android/Sdk/ndk/26.2.11394342/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android34-clang
export TVM_HOME=/home/chuan/github/mlc-llm/3rdparty/tvm
export JAVA_HOME=/home/chuan/tools/jdk-17.0.10

android 的api级别对应表,因为我的手机是android 14 ,因此选择 34

developer.android.com/guide/topic...

  1. 下载mlc ,并编译模型
bash 复制代码
git clone --recursive https://github.com/mlc-ai/mlc-llm/
cd ./mlc-llm/
MODEL_NAME=/home/chuan/models/qwen/Qwen-7B-Chat
QUANTIZATION=q4f16_1
​
mlc_chat convert_weight $MODEL_NAME --quantization $QUANTIZATION -o $MODEL_NAME/mlc
mlc_chat gen_config $MODEL_NAME --quantization $QUANTIZATION \
  --conv-template chatml --context-window-size 768 -o $MODEL_NAME/mlc
mlc_chat compile $MODEL_NAME/mlc/mlc-chat-config.json \
    --device android -o $MODEL_NAME/mlc/Qwen-7B-Chat-${QUANTIZATION}-android.tar
  1. 将模型上传到huggingface
bash 复制代码
git clone https://huggingface.co/chuan-niy/qwen_q4f16_1
git config user.name chuan-niy
git config user.email 1500546481@qq.com
​
cd qwen_q4f16_1
cp /home/chuan/models/qwen/Qwen-7B-Chat/mlc/* ./
git add . && git commit -m "Add qwen model weights for android"
git push origin main
  1. 编译android 库
bash 复制代码
cd ./android/library
vim ./src/main/assets/app-config.json
json 复制代码
{
  "model_list": [
    {
      "model_url": "https://huggingface.co/chuan-niy/qwen_q4f16_1",
      "model_lib": "qwen_q4f16_1",
      "estimated_vram_bytes": 4348727787,
      "model_id": "Qwen-7B-Chat-hf-q4f16_1"
    }
  ],
  "model_lib_path_for_prepare_libs": {
    "Qwen-7B-Chat-hf-q4f16_1": "/home/chuan/models/qwen/Qwen-7B-Chat/mlc/Qwen-7B-Chat-q4f16_1-android.tar"
  }
}

在CMakeLists.txt中添加以下信息

vi CMakeLists.txt

bash 复制代码
set(JAVA_AWT_LIBRARY "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_JVM_LIBRARY "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_INCLUDE_PATH "/home/chuan/tools/jdk-17.0.10/include")
set(JAVA_INCLUDE_PATH2 "/home/chuan/tools/jdk-17.0.10/include/linux")
set(JAVA_AWT_INCLUDE_PATH "/home/chuan/tools/jdk-17.0.10/include")

需要修改#ifdef TVM4J_ANDROID代码的地方

vi mlc-llm/3rdparty/tvm/jvm/native/src/main/native/org_apache_tvm_native_c_api.cc

arduino 复制代码
#ifdef TVM4J_ANDROID
    _jvm->AttachCurrentThread(reinterpret_cast<void**>(&env), nullptr);
#else
    _jvm->AttachCurrentThread(reinterpret_cast<void**>(&env), nullptr);

最后运行编译

bash 复制代码
./prepare_libs.sh
  1. 接着需要用android studio 打开android项目,连接手机调试,然后运行

在手机上运行结果如下:

相关推荐
我爱学Python!6 小时前
面试问我LLM中的RAG,秒过!!!
人工智能·面试·llm·prompt·ai大模型·rag·大模型应用
蛋先生DX8 小时前
网页也能跑大模型?
前端·机器学习·llm
知来者逆10 小时前
探索大型语言模型在文化常识方面的理解能力与局限性
人工智能·gpt·深度学习·语言模型·自然语言处理·chatgpt·llm
卷心菜小温21 小时前
【BUG】P-tuningv2微调ChatGLM2-6B时所踩的坑
python·深度学习·语言模型·nlp·bug
爱喝白开水a1 天前
关于大模型在企业生产环境中的独立部署问题
人工智能·深度学习·llm·大语言模型·ai大模型·计算机技术·本地部署大模型
Langchain1 天前
不可错过!CMU最新《生成式人工智能大模型》课程:从文本、图像到多模态大模型
人工智能·自然语言处理·langchain·大模型·llm·大语言模型·多模态大模型
龙的爹23331 天前
论文翻译 | Generated Knowledge Prompting for Commonsense Reasoning
人工智能·gpt·机器学习·语言模型·自然语言处理·nlp·prompt
龙的爹23331 天前
论文翻译 | Model-tuning Via Prompts Makes NLP Models Adversarially Robust
人工智能·gpt·语言模型·自然语言处理·nlp·prompt
幽影相随1 天前
构建llama.cpp并在linux上使用gpu
llm·llama.cpp
AAI机器之心1 天前
LLM大模型:开源RAG框架汇总
人工智能·chatgpt·开源·大模型·llm·大语言模型·rag