在ARM64 KyLin计算机上安装llama.cpp

不知何故官方没有提供ARM64 Linux二进制文件，很多很偏的系统都有。

复制代码

详细信息
cuda : display total and free VRAM capacity during device initialization (#20185)

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Intel (x64)
iOS XCFramework
Linux:

Ubuntu x64 (CPU)
Ubuntu x64 (Vulkan)
Ubuntu x64 (ROCm 7.2)
Ubuntu s390x (CPU)
Windows:

Windows x64 (CPU)
Windows arm64 (CPU)
Windows x64 (CUDA 12) - CUDA 12.4 DLLs
Windows x64 (CUDA 13) - CUDA 13.1 DLLs
Windows x64 (Vulkan)
Windows x64 (SYCL)
Windows x64 (HIP)
openEuler:

openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

1.安装docker镜像，失败

按照官方文档的步骤，

拉取镜像

复制代码

kylin@kylin-aaa:~$ sudo docker pull ghcr.io/ggml-org/llama.cpp:light
[sudo] kylin 的密码：
light: Pulling from ggml-org/llama.cpp
b1cba2e842ca: Pull complete 
a9a77c80136a: Pull complete 
658538cc0060: Pull complete 
972d178e9b25: Pull complete 
4f4fb700ef54: Pull complete 
Digest: sha256:e8edb700eaa1919763a28fa0f81800d4398395cd0a3ba09f5e8072067011cc29
Status: Downloaded newer image for ghcr.io/ggml-org/llama.cpp:light

下载模型

复制代码

kylin@kylin-aaa:/shujv/par/models$ curl -LO https://hf-mirror.com/unsloth/Qwen3.5-0.8B-GGUF/resolve/main/Qwen3.5-0.8B-Q4_K_M.gguf
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1379    0  1379    0     0   1971      0 --:--:-- --:--:-- --:--:--  1970
100  507M  100  507M    0     0  13.1M      0  0:00:38  0:00:38 --:--:-- 14.8M

运行容器, 无论是前台还是后台都是一样的"exec format error"

复制代码

kylin@kylin-aaa:/shujv/par/models$ sudo docker run -v /shujv/par/models:/models --entrypoint /app/llama-cli ghcr.io/ggml-org/llama.cpp:light -m /models/Qwen3.5-0.8B-Q4_K_M.gguf
standard_init_linux.go:211: exec user process caused "exec format error"

kylin@kylin-aaa:/shujv/par/models$ sudo docker run -d -v /shujv/par/models:/models --name llama ghcr.io/ggml-org/llama.cpp:light 
65b6288401672247130a06069cace35be17c2699f7df08cc6608f2639047ab95

kylin@kylin-aaa:/shujv/par/models$ sudo docker exec -it llama bash
Error response from daemon: Container 65b6288401672247130a06069cace35be17c2699f7df08cc6608f2639047ab95 is not running

kylin@kylin-aaa:/shujv/par/models$ sudo docker logs llama
standard_init_linux.go:211: exec user process caused "exec format error"

2.从源代码编译安装

按照官方文档的步骤，

拉取源码

因为宿主机上没有git, 所以进入gcc容器拉取

复制代码

kylin@kylin-aaa:/shujv/par$ git clone https://github.com/ggml-org/llama.cpp
程序"git"尚未安装。 您可以使用以下命令安装：
sudo apt install git
您必须启用main 组件
kylin@kylin-aaa:/shujv/par$ sudo docker start gcc
gcc
kylin@kylin-aaa:/shujv/par$ sudo docker exec -it gcc bash

root@66d4e20ec1d7:/par# git clone https://github.com/ggml-org/llama.cpp
Cloning into 'llama.cpp'...
remote: Enumerating objects: 82585, done.
remote: Counting objects: 100% (172/172), done.
remote: Compressing objects: 100% (148/148), done.
remote: Total 82585 (delta 88), reused 24 (delta 24), pack-reused 82413 (from 3)
Receiving objects: 100% (82585/82585), 311.79 MiB | 1.32 MiB/s, done.
Resolving deltas: 100% (59368/59368), done.

然而gcc容器中缺cmake工具，又回到宿主机编译

复制代码

root@66d4e20ec1d7:/par# cd llama.cpp
root@66d4e20ec1d7:/par/llama.cpp# cmake -B build
bash: cmake: command not found
root@66d4e20ec1d7:/par/llama.cpp# exit
exit
kylin@kylin-aaa:/shujv/par$ cd llama.cpp
kylin@kylin-aaa:/shujv/par/llama.cpp$ cmake -B build
CMake Error: The source directory "/shujv/par/llama.cpp/build" does not exist.
Specify --help for usage, or press the help button on the CMake GUI.


kylin@kylin-aaa:/shujv/par$ sudo chmod 777 llama.cpp -R


kylin@kylin-aaa:/shujv/par/llama.cpp$ cmake -B .
CMake Error at CMakeLists.txt:1 (cmake_minimum_required):
  CMake 3.14...3.28 or higher is required.  You are running version 3.5.1
-- Configuring incomplete, errors occurred!

终究还是因为cmake版本太低失败。

下载了一个cmake高版本，cmake可以了，但是低版本gcc报错了

复制代码

kylin@kylin-aaa:/shujv/par/llama.cpp$ /shujv/par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake -B .
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
...
-- Adding CPU backend variant ggml-cpu: -mcpu=native 
-- ggml version: 0.9.7
-- ggml commit:  unknown
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found OpenSSL: /usr/lib/aarch64-linux-gnu/libcrypto.so (found version "1.0.2g")
-- Performing Test OPENSSL_VERSION_SUPPORTED
-- Performing Test OPENSSL_VERSION_SUPPORTED - Failed
-- Generating embedded license file for target: common
-- Configuring done (2.9s)
-- Generating done (1.0s)
-- Build files have been written to: /shujv/par/llama.cpp
kylin@kylin-aaa:/shujv/par/llama.cpp$ 

kylin@kylin-aaa:/shujv/par/llama.cpp$ cmake --build build --config Release
Error: could not load cache
kylin@kylin-aaa:/shujv/par/llama.cpp$ cmake --build . --config Release

kylin@kylin-aaa:/shujv/par/llama.cpp$ cmake --build . --config Release
[  0%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o
[  0%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml.cpp.o
/shujv/par/llama.cpp/ggml/src/ggml.cpp:16:13: warning: 'ggml_uncaught_exception_init' defined but not used [-Wunused-variable]
 static bool ggml_uncaught_exception_init = []{
             ^
[  1%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o
[  1%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o
[  1%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-opt.cpp.o
[  1%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-threading.cpp.o
[  1%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-quants.c.o
[  2%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o
/shujv/par/llama.cpp/ggml/src/gguf.cpp: In member function 'const T& gguf_kv::get_val(size_t) const':
/shujv/par/llama.cpp/ggml/src/gguf.cpp:195:12: error: expected '(' before 'constexpr'
         if constexpr (std::is_same<T, std::string>::value) {
            ^
...

/shujv/par/llama.cpp/ggml/src/gguf.cpp: In member function 'bool gguf_reader::read(std::vector<T>&, size_t) const [with T = double; size_t = long unsigned int]':
/shujv/par/llama.cpp/ggml/src/gguf.cpp:304:5: warning: control reaches end of non-void function [-Wreturn-type]
     }
     ^
ggml/src/CMakeFiles/ggml-base.dir/build.make:176: recipe for target 'ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o' failed
make[2]: *** [ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o] Error 1
CMakeFiles/Makefile2:2272: recipe for target 'ggml/src/CMakeFiles/ggml-base.dir/all' failed
make[1]: *** [ggml/src/CMakeFiles/ggml-base.dir/all] Error 2
Makefile:145: recipe for target 'all' failed
make: *** [all] Error 2

重新进入gcc容器, 重新cmake，可以了

复制代码

root@66d4e20ec1d7:/par/testcc# cd /par/llama.cpp
root@66d4e20ec1d7:/par/llama.cpp# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake -B .
CMake Error: The current CMakeCache.txt directory /par/llama.cpp/CMakeCache.txt is different than the directory /shujv/par/llama.cpp where CMakeCache.txt was created. This may result in binaries being created in the wrong place. If you are not sure, reedit the CMakeCache.txt
CMake Error: The source "/par/llama.cpp/CMakeLists.txt" does not match the source "/shujv/par/llama.cpp/CMakeLists.txt" used to generate cache.  Re-run cmake with a different source directory.


root@66d4e20ec1d7:/par/llama.cpp# cd build
root@66d4e20ec1d7:/par/llama.cpp/build# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake -B .
CMake Error: The source directory "/par/llama.cpp/build" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.
root@66d4e20ec1d7:/par/llama.cpp/build# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake ..   
CMake Error: The current CMakeCache.txt directory /par/llama.cpp/CMakeCache.txt is different than the directory /shujv/par/llama.cpp where CMakeCache.txt was created. This may result in binaries being created in the wrong place. If you are not sure, reedit the CMakeCache.txt
root@66d4e20ec1d7:/par/llama.cpp/build# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake ..
-- The C compiler identification is GNU 14.2.0
-- The CXX compiler identification is GNU 14.2.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done

...
-- Adding CPU backend variant ggml-cpu: -U__ARM_FEATURE_MATMUL_INT8;-U__ARM_FEATURE_SVE;-U__ARM_FEATURE_SME;-mcpu=tsv110+crc+nordma+nofp16+dotprod+noi8mm+nosve+nosme 
-- ggml version: 0.9.7
-- ggml commit:  d417bc43d-dirty
-- Found OpenSSL: /usr/lib/aarch64-linux-gnu/libcrypto.so (found version "3.0.15")
-- Performing Test OPENSSL_VERSION_SUPPORTED
-- Performing Test OPENSSL_VERSION_SUPPORTED - Success
-- OpenSSL found: 3.0.15
-- Generating embedded license file for target: common
-- Configuring done (4.5s)
-- Generating done (0.4s)
-- Build files have been written to: /par/llama.cpp/build
root@66d4e20ec1d7:/par/llama.cpp/build# 
root@66d4e20ec1d7:/par/llama.cpp/build# ls
CMakeCache.txt	CTestTestfile.cmake    Makefile  bin		      common		     examples  license.cpp	   llama-version.cmake	pocs  tests  vendor
CMakeFiles	DartConfiguration.tcl  Testing	 cmake_install.cmake  compile_commands.json  ggml      llama-config.cmake  llama.pc		src   tools
root@66d4e20ec1d7:/par/llama.cpp/build# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake --build build --config Release
Error: /par/llama.cpp/build/build is not a directory
root@66d4e20ec1d7:/par/llama.cpp/build# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake --build . --config Release

root@66d4e20ec1d7:/par/llama.cpp/build# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake --build . --config Release
[  0%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o
[  0%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml.cpp.o
[  0%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o
[  1%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o
...
[ 99%] Linking CXX executable ../../bin/llama-fit-params
[ 99%] Built target llama-fit-params
[ 99%] Building CXX object tools/results/CMakeFiles/llama-results.dir/results.cpp.o
[100%] Linking CXX executable ../../bin/llama-results
[100%] Built target llama-results

编译出的llama-cli能加载刚才下载的模型，接收用户对话。

复制代码

root@66d4e20ec1d7:/par/llama.cpp/build/bin# ./llama-cli -m /par/models/Qwen3.5-0.8B-Q4_K_M.gguf --ctx-size 16384 -cnv

Loading model...  


▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀

build      : b8245-d417bc43d
model      : Qwen3.5-0.8B-Q4_K_M.gguf
modalities : text

available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read               add a text file


>

再用llama-server启动服务器。

复制代码

root@66d4e20ec1d7:/par/llama.cpp/build/bin# ./llama-server -m /par/models/Qwen3.5-0.8B-Q4_K_M.gguf --ctx-size 16384 --port 8080 
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8245 (d417bc43d) with GNU 14.2.0 for Linux aarch64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

Running without SSL
init: using 7 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/par/models/Qwen3.5-0.8B-Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: no devices with dedicated memory found
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.91 seconds
llama_model_loader: loaded meta data with 46 key-value pairs and 320 tensors from /par/models/Qwen3.5-0.8B-Q4_K_M.gguf (version GGUF V3 (latest))
...

llama_context: n_ctx_seq (16384) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     3.79 MiB
llama_kv_cache:        CPU KV buffer size =   192.00 MiB
llama_kv_cache: size =  192.00 MiB ( 16384 cells,   6 layers,  4/1 seqs), K (f16):   96.00 MiB, V (f16):   96.00 MiB
llama_memory_recurrent:        CPU RS buffer size =    77.06 MiB
llama_memory_recurrent: size =   77.06 MiB (     4 cells,  24 layers,  4 seqs), R (f32):    5.06 MiB, S (f32):   72.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve:        CPU compute buffer size =   489.00 MiB
sched_reserve: graph nodes  = 3123 (with bs=512), 1377 (with bs=1)
sched_reserve: graph splits = 1
sched_reserve: reserve took 11.13 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv    load_model: initializing slots, n_slots = 4
common_speculative_is_compat: the target context does not support partial sequence removal
srv    load_model: speculative decoding not supported by this context
slot   load_model: id  0 | task -1 | new slot, n_ctx = 16384
slot   load_model: id  1 | task -1 | new slot, n_ctx = 16384
slot   load_model: id  2 | task -1 | new slot, n_ctx = 16384
slot   load_model: id  3 | task -1 | new slot, n_ctx = 16384
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv  update_slots: all slots are idle

可以用curl命令连接服务器显示模型信息和提问。虽然答非所问。

复制代码

root@66d4e20ec1d7:/par# curl http://localhost:8080/v1/models

{"models":[{"name":"Qwen3.5-0.8B-Q4_K_M.gguf","model":"Qwen3.5-0.8B-Q4_K_M.gguf","modified_at":"","size":"","digest":"","type":"model","description":"","tags":[""],"capabilities":["completion"],"parameters":"","details":{"parent_model":"","format":"gguf","family":"","families":[""],"parameter_size":"","quantization_level":""}}],"object":"list","data":[{"id":"Qwen3.5-0.8B-Q4_K_M.gguf","aliases":[],"tags":[],"object":"model","created":1773021697,"owned_by":"llamacpp","meta":{"vocab_type":2,"n_vocab":248320,"n_ctx_train":262144,"n_embd":1024,"n_params":752393024,"siz


curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen3.5-0.8B-Q4_K_M","prompt": "汉译英：人工智能的未来发展","max_tokens": 100, "temperature": 0.8}'

root@66d4e20ec1d7:/par# curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen3.5-0.8B-Q4_K_M","prompt": "汉译英：人工智能的未来发展","max_tokens": 100, "temperature": 0.8}'ature": 0.8}'alhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen3.5-0.8B-Q4_K_M","prompt": "汉译英：人工智能的未来发展","
{"choices":[{"text":"\n\n<think>\n\n</think>\n\n人工智能的未来发展\n\n**1. 智能化与自主性**\n随着机器学习算法的不断迭代，人工智能在**自主性**上实现质的飞跃。从传统的被动执行指令，发展到能够**自主规划、自主决策**，甚至能够自主发现并优化自身的效率。这种能力的提升，标志着人工智能从"模仿人类"转向"模拟人类"，是迈向深度智能的重要里程碑。\n\n**2. 复杂性与环境适应性**\n面对日益复杂的**自然","index":0,"logprobs":null,"finish_reason":"length"}],"created":1773022021,"model":"Qwen3.5-0.8B-Q4_K_M.gguf","system_fingerprint":"b8245-d417bc43d","object":"text_completion","usage":{"completion_tokens":100,"prompt_tokens":7,"total_tokens":107},"id":"chatcmpl-478Re5iVd6NvrWPvgRpTgm5pPVrNSN5U","timings":{"cache_n":0,"prompt_n":7,"prompt_ms":218.4,"prompt_per_token_ms":31.2,"prompt_per_second":32.05128205128205,"predicted_n":100,"predicted_ms":7099.046,"predicted_per_token_ms":70.99046,"predicted_per_second":14.08639977822372}}
root@66d4e20ec1d7:/par#