利用GPU的OpenCL和MLC-LLM框架运行小语言模型-地瓜RDK X5开发板-非量产算法仅供整活

RDK™ X5机器人开发套件，D-Robotics RDK X5搭载Sunrise 5智能计算芯片，可提供高达10 Tops的算力，是一款面向智能计算与机器人应用的全能开发套件，接口丰富，极致易用。

本文利用其32GFLOPS的一颗小GPU，支持OpenCL，搭配MLC-LLM框架运行小语言模型。其中，千问2.5 - 0.5B大约0.5 tok/s，SmolLM - 130M大约2tok/s。

注意：RDK X5的核心部件为其10TOPS的BPU，并非GPU。本文只是把玩它的GPU，非量产算法。

摘要

百分之零的BPU占用，几乎百分之零的CPU占用，将主要计算资源保留给主要算法。
采用int4量化，内存和带宽占用理论上int8量化的一半，空余内存4.1GB。
采用GPU作为计算部件，摸着通用计算的尾巴。理论上，可以使用X5的这颗支持OpenCL的GPU运行HuggingFace上面的任何模型。
所有涉及的代码仓库或者模型权重均为Apache 2.0协议开源。

运行效果

Bilibili:

https://www.bilibili.com/video/BV1Q428YfEKD

Qwen2.5-0.5B-Instruct-q4f32_1-MLC:

调整/set temperature=0.5;top_p=0.8;seed=23;max_tokens=60;

SmolLM-135M-Instruct-q4f32_1-MLC

RDK X5

不多介绍了，史上千元内最强机器人开发套件。

MLC - LLM

MLC-LLM相当于工具链和推理库，本文利用的是Android目标产物对Mail-GPU的OpenCL支持。

OpenCL

OpenCL is Widely Deployed and Used

OpenCL for Low-level Parallel Programing, OpenCL speeds applications by offloading their most computationally intensive code onto accelerator processors - or devices. OpenCL developers use C or C++ based kernel languages to code programs that are passed through a device compiler for parallel execution on accelerator devices.

步骤参考

注：需要一定Linux操作经验，文件和路径请仔细核对，任何No such file or directory, No module named "xxx", command not found 等报错请仔细检查，请勿逐条复制运行。

调整RDK X5到最佳状态

超频到全核心1.8Ghz，全程Performace调度

bash 复制代码

sudo bash -c "echo 1 > /sys/devices/system/cpu/cpufreq/boost"
sudo bash -c "echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor"

卸载暂时不需要的包以节约外存和内存，得到一个reboot后存储占用约5.5GB，内存占用约240MB的RDK X5环境。

bash 复制代码

sudo apt remove *xfce*
sudo apt remove hobot-io-samples hobot-multimedia-samples hobot-models-basic
sudo apt autoremove

开启8GB的Swap交换空间

bash 复制代码

fallocate -l 8G /swapfile  # 创建一个用作交换文件的文件，4GB大小
chmod 600 /swapfile  # 阻止普通用户读取
mkswap /swapfile     # 在这个文件上创建一个 Linux 交换区
swapon /swapfile     # 激活交换区

Conda安装 (可选)

bash 复制代码

wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py310_24.7.1-0-Linux-aarch64.sh
bash Miniconda3-py310_24.7.1-0-Linux-aarch64.sh

# 安装
Enter, q, Enter, yes

安装相关apt包

bash 复制代码

sudo apt install gcc libtinfo-dev zlib1g-dev build-essential cmake libedit-dev libxml2-dev rustc cargo doxygen -y
sudo apt install -y git-lfs clinfo libgtest-dev

验证OpenCL驱动

使用clinfo命令，出现以下内容说明OpenCL驱动没有问题。

OpenCL驱动稳稳的，给系统软件的同事点赞👍！

bash 复制代码

$ clinfo
Number of platforms                               1
  Platform Name                                   Vivante OpenCL Platform
  Platform Vendor                                 Vivante Corporation
  Platform Version                                OpenCL 3.0 V6.4.14.9.674707
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_il_program cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_gl_sharing cl_khr_command_buffer 
  Platform Extensions with Version                cl_khr_byte_addressable_store                                    0x400000 (1.0.0)
                                                  cl_khr_fp16                                                      0x400000 (1.0.0)
                                                  cl_khr_il_program                                                0x400000 (1.0.0)
                                                  cl_khr_global_int32_base_atomics                                 0x400000 (1.0.0)
                                                  cl_khr_global_int32_extended_atomics                             0x400000 (1.0.0)
                                                  cl_khr_local_int32_base_atomics                                  0x400000 (1.0.0)
                                                  cl_khr_local_int32_extended_atomics                              0x400000 (1.0.0)
                                                  cl_khr_gl_sharing                                                0x400000 (1.0.0)
                                                  cl_khr_command_buffer                                            0x400000 (1.0.0)
  Platform Numeric Version                        0xc00000 (3.0.0)
  Platform Host timer resolution                  0ns

  Platform Name                                   Vivante OpenCL Platform
Number of devices                                 1
  Device Name                                     Vivante OpenCL Device GC8000L.6214.0000
  Device Vendor                                   Vivante Corporation
  Device Vendor ID                                0x564956
  Device Version                                  OpenCL 3.0 
  Device Numeric Version                          0xc00000 (3.0.0)
  Driver Version                                  OpenCL 3.0 V6.4.14.9.674707
  Device OpenCL C Version                         OpenCL C 1.2 
  Device OpenCL C all versions                    OpenCL C                                                         0x400000 (1.0.0)
                                                  OpenCL C                                                         0x401000 (1.1.0)
                                                  OpenCL C                                                         0x402000 (1.2.0)
                                                  OpenCL C                                                         0xc00000 (3.0.0)
  Device OpenCL C features                        __opencl_c_images                                                0x400000 (1.0.0)
                                                  __opencl_c_int64                                                 0x400000 (1.0.0)
  Latest comfornace test passed                   v2021-03-25-00
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               1
  Max clock frequency                             996MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x1024
  Max work group size                             1024
  Preferred work group size multiple (device)     16
  Preferred work group size multiple (kernel)     16
  Max sub-groups per work group                   0
  Preferred / native vector sizes                 
    char                                                 4 / 4       
    short                                                4 / 4       
    int                                                  4 / 4       
    long                                                 4 / 4       
    half                                                 4 / 4        (cl_khr_fp16)
    float                                                4 / 4       
    double                                               0 / 0        (n/a)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    32, Little-Endian
  Global memory size                              268435456 (256MiB)
  Error Correction support                        Yes
  Max memory allocation                           134217728 (128MiB)
  Unified memory for Host and Device              Yes
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 No
    Fine-grained buffer sharing                   No
    Fine-grained system sharing                   No
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       2048 bits (256 bytes)
  Preferred alignment for atomics                 
    SVM                                           0 bytes
    Global                                        0 bytes
    Local                                         0 bytes
  Atomic memory capabilities                      relaxed, work-group scope
  Atomic fence capabilities                       relaxed, acquire/release, work-group scope
  Max size for global variable                    0
  Preferred total size of global vars             0
  Global Memory cache type                        Read/Write
  Global Memory cache size                        8192 (8KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 8192 images
    Max 2D image size                             8192x8192 pixels
    Max 3D image size                             8192x8192x8192 pixels
    Max number of read image args                 128
    Max number of write image args                8
    Max number of read/write image args           0
  Pipe support                                    No
  Max number of pipe args                         0
  Max active pipe reservations                    0
  Max pipe packet size                            0
  Local memory type                               Global
  Local memory size                               32768 (32KiB)
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Generic address space support                   No
  Max size of kernel argument                     1024
  Queue properties (on host)                      
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Device enqueue capabilities                     (n/a)
  Queue properties (on device)                    
    Out-of-order execution                        No
    Profiling                                     No
    Preferred size                                0
    Max size                                      0
  Max queues on device                            0
  Max events on device                            0
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Non-uniform work-groups                       No
    Work-group collective functions               No
    Sub-group independent forward progress        No
    IL version                                    SPIR-V_1.5 
    ILs with version                              SPIR-V                                                           0x405000 (1.5.0)
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                (n/a)
  Built-in kernels with version                   (n/a)
  Device Extensions                               cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_il_program cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_gl_sharing cl_khr_command_buffer 
  Device Extensions with Version                  cl_khr_byte_addressable_store                                    0x400000 (1.0.0)
                                                  cl_khr_fp16                                                      0x400000 (1.0.0)
                                                  cl_khr_il_program                                                0x400000 (1.0.0)
                                                  cl_khr_global_int32_base_atomics                                 0x400000 (1.0.0)
                                                  cl_khr_global_int32_extended_atomics                             0x400000 (1.0.0)
                                                  cl_khr_local_int32_base_atomics                                  0x400000 (1.0.0)
                                                  cl_khr_local_int32_extended_atomics                              0x400000 (1.0.0)
                                                  cl_khr_gl_sharing                                                0x400000 (1.0.0)
                                                  cl_khr_command_buffer                                            0x400000 (1.0.0)

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [P0]
  clCreateContext(NULL, ...) [default]            Success [P0]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 Vivante OpenCL Platform
    Device Name                                   Vivante OpenCL Device GC8000L.6214.0000
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Vivante OpenCL Platform
    Device Name                                   Vivante OpenCL Device GC8000L.6214.0000
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Vivante OpenCL Platform
    Device Name                                   Vivante OpenCL Device GC8000L.6214.0000

这段信息描述了一个名为Vivante OpenCL Platform的OpenCL平台及其设备的详细规格。从提供的信息来看，这里没有直接的技术问题或错误，但是有几个点需要注意或可能需要进一步调查：

Half-precision Floating-point support 和 Single-precision Floating-point support 都表明该设备不支持denormals（非正规化数），这可能会对某些计算精度敏感的应用程序产生影响。
Double-precision Floating-point support 标记为 (n/a) 表明此设备可能不支持双精度浮点运算。对于需要高精度计算的应用，这可能是一个限制因素。
Max compute units 只有 1，这意味着该GPU可能在并行处理能力上有限制，尤其是在处理复杂的图形或计算密集型任务时。
Sub-group independent forward progress 为 No，这表示如果应用程序依赖于子组独立前向进展（sub-group independent forward progress）特性，则可能需要其他解决方案。
Profiling 在设备端为 No，意味着无法获取设备上的性能数据来进行分析优化。
Queue properties (on device) 的 Preferred size 和 Max size 均为 0，这可能是信息展示的问题或者意味着队列大小不受限制，后者在实际应用中并不常见。

安装Rust

参考阿里源：https://developer.aliyun.com/mirror/rustup

bash 复制代码

# Rust 官方
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# or, 使用阿里云安装脚本
curl --proto '=https' --tlsv1.2 -sSf https://mirrors.aliyun.com/repo/rust/rustup-init.sh | sh

输入ructs --version ，出现以下内容说明安装成功

bash 复制代码

$ source ~/.bashrc
$ rustc --version
rustc 1.81.0 (eeb90cda1 2024-09-04)

为Rust更换阿里源，在.bashrc中加入以下内容

bash 复制代码

export RUSTUP_UPDATE_ROOT=https://mirrors.aliyun.com/rustup/rustup
export RUSTUP_DIST_SERVER=https://mirrors.aliyun.com/rustup

源码安装CMake

获取最新版本的CMake （>= 3.24，板卡上是3.22且无法apt更新）

bash 复制代码

wget https://github.com/Kitware/CMake/releases/download/v3.30.5/cmake-3.30.5-linux-aarch64.sh
git clone https://github.com/Kitware/CMake.git

编译&安装

bash 复制代码

cd CMake
./bootstrap
make
sudo make install

使用cmake --version命令来验证cmake的版本

源码安装TVM Unity Compiler

拉取LLVM

bash 复制代码

# 18.1.8
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-18.1.8/clang+llvm-18.1.8-aarch64-linux-gnu.tar.xz
tar -xvf clang+llvm-18.1.8-aarch64-linux-gnu.tar.xz

拉取tvm仓库

bash 复制代码

git clone --recursive https://github.com/mlc-ai/relax.git tvm_unity

其中，git clone --recursive 是 Git 中的一个命令，用于克隆一个包含子模块（submodule）的仓库。这个命令会递归地将所有子模块也一起克隆下来。

如果clone仓库的时候网络不佳导致克隆中断，可以继续git clone动作，如果提示已存在非空目录，此时需要进入仓库（进入那个目录），使用以下命令手动初始化和更新子模块：

bash 复制代码

git submodule update --init --recursive

编译

bash 复制代码

cd tvm_unity/
mkdir -p build && cd build
cp ../cmake/config.cmake .

使用vim在config.cmake文件中修改下面几项：

bash 复制代码

set(CMAKE_BUILD_TYPE RelWithDebInfo) #这一项在文件中没有，需要添加
set(USE_OPENCL ON) #这一项在文件中可以找到，需要修改
set(HIDE_PRIVATE_SYMBOLS ON) #这一项在文件中没有，需要添加
set(USE_LLVM /media/rootfs/gpu_llm_sd/clang+llvm-17.0.2-aarch64-linux-gnu/bin/llvm-config)

开始编译，在编译到100%的时候内存会非常非常紧张，这时候需要耐心等待。

bash 复制代码

cmake ..
make -j6 # 为什么不j8？因为要留俩核心拉取下一步的代码哈哈哈

安装tvm，安装会build wheel，会非常慢，请耐心等待。如果Ctrl + C，可能需要重新编译，否则python会一直报错。

bash 复制代码

cd ../python
pip3 install --user .

在.bashrc添加环境变量，并激活环境变量source ~/.bashrc

bash 复制代码

export PATH="$PATH:/root/.local/bin"
export PYTHONPATH=/media/rootfs/gpu_llm/tvm_unity/python:$PYTHONPATH

安装成功后，使用tvmc命令，出现以下日志，说明安装成功

复制代码

$ tvmc
usage: tvmc [--config CONFIG] [-v] [--version] [-h] {run,tune,compile} ...

TVM compiler driver

options:
  --config CONFIG     configuration json file
  -v, --verbose       increase verbosity
  --version           print the version and exit
  -h, --help          show this help message and exit.

commands:
  {run,tune,compile}
    run               run a compiled module
    tune              auto-tune a model
    compile           compile a model.

TVMC - TVM driver command-line interface

在对应的Python3解释器中，也可使用以下命令确认OpenCL设备存在。

bash 复制代码

>>> import tvm
>>> tvm.opencl().exist
True

源码安装MLC-LLM

拉取项目

bash 复制代码

git clone --recursive https://github.com/mlc-ai/mlc-llm.git
cd mlc-llm
# git submodule update --init --recursive # 如果clone中断

源码编译

bash 复制代码

# create build directory
mkdir -p build && cd build
# generate build configuration
python ../cmake/gen_cmake_config.py
# build mlc_llm libraries
cmake .. && cmake --build . --parallel $(nproc) && cd ..

配置环境变量

bash 复制代码

export MLC_LLM_SOURCE_DIR=/media/rootfs/gpu_llm/mlc-llm
export PYTHONPATH=$MLC_LLM_SOURCE_DIR/python:$PYTHONPATH
alias mlc_llm="python -m mlc_llm"

可能缺少的Python包

bash 复制代码

pip install pydantic shortuuid fastapi requests tqdm prompt-toolkit safetensors torch

使用命令mlc_llm chat -h，若出现以下内容，则说明成功

bash 复制代码

$ mlc_llm chat -h
usage: MLC LLM Chat CLI [-h] [--device DEVICE] [--model-lib MODEL_LIB] [--overrides OVERRIDES] model

positional arguments:
  model                 A path to ``mlc-chat-config.json``, or an MLC model directory that contains
                        `mlc-chat-config.json`. It can also be a link to a HF repository pointing to an
                        MLC compiled model. (required)

options:
  -h, --help            show this help message and exit
  --device DEVICE       The device used to deploy the model such as "cuda" or "cuda:0". Will detect
                        from local available GPUs if not specified. (default: "auto")
  --model-lib MODEL_LIB
                        The full path to the model library file to use (e.g. a ``.so`` file). If
                        unspecified, we will use the provided ``model`` to search over possible paths.
                        It the model lib is not found, it will be compiled in a JIT manner. (default:
                        "None")
  --overrides OVERRIDES
                        Model configuration override. Supports overriding, `context_window_size`,
                        `prefill_chunk_size`, `sliding_window_size`, `attention_sink_size`,
                        `max_num_sequence` and `tensor_parallel_shards`. The overrides could be
                        explicitly specified via details knobs, e.g. --overrides
                        "context_window_size=1024;prefill_chunk_size=128". (default: "")

Ex1. 语言小模型 SmolLM

下载已经int4量化好的模型

编译模型

bash 复制代码

# 135M
mlc_llm compile dist/SmolLM-135M-Instruct-q4f32_1-MLC/mlc-chat-config.json \
    --device opencl \
    --output libs/SmolLM-135M-Instruct-q4f32_1-MLC.so

# 360M
mlc_llm compile dist/SmolLM-360M-Instruct-q4f32_1-MLC/mlc-chat-config.json \
    --device opencl \
    --output libs/SmolLM-360M-Instruct-q4f32_1-MLC.so

运行聊天

bash 复制代码

# 135M
mlc_llm chat dist/SmolLM-135M-Instruct-q4f32_1-MLC \
    --device opencl \
    --model-lib libs/SmolLM-135M-Instruct-q4f32_1-MLC.so
    
# 360M
mlc_llm chat dist/SmolLM-360M-Instruct-q4f32_1-MLC \
    --device opencl \
    --model-lib libs/SmolLM-360M-Instruct-q4f32_1-MLC.so

Ex2. 通义千问：Qwen2.5 - 0.5B

下载已经int4量化好的模型

bash 复制代码

https://huggingface.co/mlc-ai/Qwen2.5-0.5B-Instruct-q4f32_1-MLC

编译模型

bash 复制代码

mlc_llm compile dist/Qwen2.5-0.5B-Instruct-q4f32_1-MLC/mlc-chat-config.json \
    --quantization q4f32_1 \
    --model-type qwen2 \
    --device opencl \
    --output libs/Qwen2.5-0.5B-Instruct-q4f32_1-MLC.so

运行聊天

bash 复制代码

mlc_llm chat dist/Qwen2.5-0.5B-Instruct-q4f32_1-MLC/ \
    --device opencl \
    --model-lib libs/Qwen2.5-0.5B-Instruct-q4f32_1-MLC.so

使用srpi-config命令调整ION内存

嵌入式设备的GPU一般没有独立显存，是跟别的ip共用内存的。所以我们需要调大这部分内存。注意，X5是一次性将这些内存完全分配，所以Ubuntu系统显示的可用内存会变小。

性能监测命令

CPU和内存占用(证明CPU占用极低)

bash 复制代码

htop

GPU占用

bash 复制代码

cat /sys/kernel/debug/gc/load
# watch -n 2 cat /sys/kernel/debug/gc/load

BPU占用(证明没有BPU参与计算)

bash 复制代码

hrut_somstatus
# watch -n 2 hrut_somstatus

外存占用

bash 复制代码

df -h ~ 
# watch -n 2 df -h ~

测试提问

复制代码

Please introduce yourself.

Heilium walks into a bar，The bar tender says"we don't serve noble gases in here". helium doesn't react. This joke is funny because what?

Find one of the following options that is different from the others:(1) water(2) the sun (3)gasoline (4) the wind (5) cement

Find one of the following numbers in particular: (1)1 (2)2 (3)5 (4)7 (5)11 (6)13 (7)15

Tell a story about love