不知何故官方没有提供ARM64 Linux二进制文件,很多很偏的系统都有。
详细信息
cuda : display total and free VRAM capacity during device initialization (#20185)
macOS/iOS:
macOS Apple Silicon (arm64)
macOS Intel (x64)
iOS XCFramework
Linux:
Ubuntu x64 (CPU)
Ubuntu x64 (Vulkan)
Ubuntu x64 (ROCm 7.2)
Ubuntu s390x (CPU)
Windows:
Windows x64 (CPU)
Windows arm64 (CPU)
Windows x64 (CUDA 12) - CUDA 12.4 DLLs
Windows x64 (CUDA 13) - CUDA 13.1 DLLs
Windows x64 (Vulkan)
Windows x64 (SYCL)
Windows x64 (HIP)
openEuler:
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)
1.安装docker镜像,失败
按照官方文档的步骤,
拉取镜像
kylin@kylin-aaa:~$ sudo docker pull ghcr.io/ggml-org/llama.cpp:light
[sudo] kylin 的密码:
light: Pulling from ggml-org/llama.cpp
b1cba2e842ca: Pull complete
a9a77c80136a: Pull complete
658538cc0060: Pull complete
972d178e9b25: Pull complete
4f4fb700ef54: Pull complete
Digest: sha256:e8edb700eaa1919763a28fa0f81800d4398395cd0a3ba09f5e8072067011cc29
Status: Downloaded newer image for ghcr.io/ggml-org/llama.cpp:light
下载模型
kylin@kylin-aaa:/shujv/par/models$ curl -LO https://hf-mirror.com/unsloth/Qwen3.5-0.8B-GGUF/resolve/main/Qwen3.5-0.8B-Q4_K_M.gguf
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1379 0 1379 0 0 1971 0 --:--:-- --:--:-- --:--:-- 1970
100 507M 100 507M 0 0 13.1M 0 0:00:38 0:00:38 --:--:-- 14.8M
运行容器, 无论是前台还是后台都是一样的"exec format error"
kylin@kylin-aaa:/shujv/par/models$ sudo docker run -v /shujv/par/models:/models --entrypoint /app/llama-cli ghcr.io/ggml-org/llama.cpp:light -m /models/Qwen3.5-0.8B-Q4_K_M.gguf
standard_init_linux.go:211: exec user process caused "exec format error"
kylin@kylin-aaa:/shujv/par/models$ sudo docker run -d -v /shujv/par/models:/models --name llama ghcr.io/ggml-org/llama.cpp:light
65b6288401672247130a06069cace35be17c2699f7df08cc6608f2639047ab95
kylin@kylin-aaa:/shujv/par/models$ sudo docker exec -it llama bash
Error response from daemon: Container 65b6288401672247130a06069cace35be17c2699f7df08cc6608f2639047ab95 is not running
kylin@kylin-aaa:/shujv/par/models$ sudo docker logs llama
standard_init_linux.go:211: exec user process caused "exec format error"
2.从源代码编译安装
按照官方文档的步骤,
拉取源码
因为宿主机上没有git, 所以进入gcc容器拉取
kylin@kylin-aaa:/shujv/par$ git clone https://github.com/ggml-org/llama.cpp
程序"git"尚未安装。 您可以使用以下命令安装:
sudo apt install git
您必须启用main 组件
kylin@kylin-aaa:/shujv/par$ sudo docker start gcc
gcc
kylin@kylin-aaa:/shujv/par$ sudo docker exec -it gcc bash
root@66d4e20ec1d7:/par# git clone https://github.com/ggml-org/llama.cpp
Cloning into 'llama.cpp'...
remote: Enumerating objects: 82585, done.
remote: Counting objects: 100% (172/172), done.
remote: Compressing objects: 100% (148/148), done.
remote: Total 82585 (delta 88), reused 24 (delta 24), pack-reused 82413 (from 3)
Receiving objects: 100% (82585/82585), 311.79 MiB | 1.32 MiB/s, done.
Resolving deltas: 100% (59368/59368), done.
然而gcc容器中缺cmake工具,又回到宿主机编译
root@66d4e20ec1d7:/par# cd llama.cpp
root@66d4e20ec1d7:/par/llama.cpp# cmake -B build
bash: cmake: command not found
root@66d4e20ec1d7:/par/llama.cpp# exit
exit
kylin@kylin-aaa:/shujv/par$ cd llama.cpp
kylin@kylin-aaa:/shujv/par/llama.cpp$ cmake -B build
CMake Error: The source directory "/shujv/par/llama.cpp/build" does not exist.
Specify --help for usage, or press the help button on the CMake GUI.
kylin@kylin-aaa:/shujv/par$ sudo chmod 777 llama.cpp -R
kylin@kylin-aaa:/shujv/par/llama.cpp$ cmake -B .
CMake Error at CMakeLists.txt:1 (cmake_minimum_required):
CMake 3.14...3.28 or higher is required. You are running version 3.5.1
-- Configuring incomplete, errors occurred!
终究还是因为cmake版本太低失败。
下载了一个cmake高版本,cmake可以了,但是低版本gcc报错了
kylin@kylin-aaa:/shujv/par/llama.cpp$ /shujv/par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake -B .
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
...
-- Adding CPU backend variant ggml-cpu: -mcpu=native
-- ggml version: 0.9.7
-- ggml commit: unknown
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found OpenSSL: /usr/lib/aarch64-linux-gnu/libcrypto.so (found version "1.0.2g")
-- Performing Test OPENSSL_VERSION_SUPPORTED
-- Performing Test OPENSSL_VERSION_SUPPORTED - Failed
-- Generating embedded license file for target: common
-- Configuring done (2.9s)
-- Generating done (1.0s)
-- Build files have been written to: /shujv/par/llama.cpp
kylin@kylin-aaa:/shujv/par/llama.cpp$
kylin@kylin-aaa:/shujv/par/llama.cpp$ cmake --build build --config Release
Error: could not load cache
kylin@kylin-aaa:/shujv/par/llama.cpp$ cmake --build . --config Release
kylin@kylin-aaa:/shujv/par/llama.cpp$ cmake --build . --config Release
[ 0%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o
[ 0%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml.cpp.o
/shujv/par/llama.cpp/ggml/src/ggml.cpp:16:13: warning: 'ggml_uncaught_exception_init' defined but not used [-Wunused-variable]
static bool ggml_uncaught_exception_init = []{
^
[ 1%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o
[ 1%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o
[ 1%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-opt.cpp.o
[ 1%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-threading.cpp.o
[ 1%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-quants.c.o
[ 2%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o
/shujv/par/llama.cpp/ggml/src/gguf.cpp: In member function 'const T& gguf_kv::get_val(size_t) const':
/shujv/par/llama.cpp/ggml/src/gguf.cpp:195:12: error: expected '(' before 'constexpr'
if constexpr (std::is_same<T, std::string>::value) {
^
...
/shujv/par/llama.cpp/ggml/src/gguf.cpp: In member function 'bool gguf_reader::read(std::vector<T>&, size_t) const [with T = double; size_t = long unsigned int]':
/shujv/par/llama.cpp/ggml/src/gguf.cpp:304:5: warning: control reaches end of non-void function [-Wreturn-type]
}
^
ggml/src/CMakeFiles/ggml-base.dir/build.make:176: recipe for target 'ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o' failed
make[2]: *** [ggml/src/CMakeFiles/ggml-base.dir/gguf.cpp.o] Error 1
CMakeFiles/Makefile2:2272: recipe for target 'ggml/src/CMakeFiles/ggml-base.dir/all' failed
make[1]: *** [ggml/src/CMakeFiles/ggml-base.dir/all] Error 2
Makefile:145: recipe for target 'all' failed
make: *** [all] Error 2
重新进入gcc容器, 重新cmake,可以了
root@66d4e20ec1d7:/par/testcc# cd /par/llama.cpp
root@66d4e20ec1d7:/par/llama.cpp# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake -B .
CMake Error: The current CMakeCache.txt directory /par/llama.cpp/CMakeCache.txt is different than the directory /shujv/par/llama.cpp where CMakeCache.txt was created. This may result in binaries being created in the wrong place. If you are not sure, reedit the CMakeCache.txt
CMake Error: The source "/par/llama.cpp/CMakeLists.txt" does not match the source "/shujv/par/llama.cpp/CMakeLists.txt" used to generate cache. Re-run cmake with a different source directory.
root@66d4e20ec1d7:/par/llama.cpp# cd build
root@66d4e20ec1d7:/par/llama.cpp/build# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake -B .
CMake Error: The source directory "/par/llama.cpp/build" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.
root@66d4e20ec1d7:/par/llama.cpp/build# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake ..
CMake Error: The current CMakeCache.txt directory /par/llama.cpp/CMakeCache.txt is different than the directory /shujv/par/llama.cpp where CMakeCache.txt was created. This may result in binaries being created in the wrong place. If you are not sure, reedit the CMakeCache.txt
root@66d4e20ec1d7:/par/llama.cpp/build# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake ..
-- The C compiler identification is GNU 14.2.0
-- The CXX compiler identification is GNU 14.2.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
...
-- Adding CPU backend variant ggml-cpu: -U__ARM_FEATURE_MATMUL_INT8;-U__ARM_FEATURE_SVE;-U__ARM_FEATURE_SME;-mcpu=tsv110+crc+nordma+nofp16+dotprod+noi8mm+nosve+nosme
-- ggml version: 0.9.7
-- ggml commit: d417bc43d-dirty
-- Found OpenSSL: /usr/lib/aarch64-linux-gnu/libcrypto.so (found version "3.0.15")
-- Performing Test OPENSSL_VERSION_SUPPORTED
-- Performing Test OPENSSL_VERSION_SUPPORTED - Success
-- OpenSSL found: 3.0.15
-- Generating embedded license file for target: common
-- Configuring done (4.5s)
-- Generating done (0.4s)
-- Build files have been written to: /par/llama.cpp/build
root@66d4e20ec1d7:/par/llama.cpp/build#
root@66d4e20ec1d7:/par/llama.cpp/build# ls
CMakeCache.txt CTestTestfile.cmake Makefile bin common examples license.cpp llama-version.cmake pocs tests vendor
CMakeFiles DartConfiguration.tcl Testing cmake_install.cmake compile_commands.json ggml llama-config.cmake llama.pc src tools
root@66d4e20ec1d7:/par/llama.cpp/build# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake --build build --config Release
Error: /par/llama.cpp/build/build is not a directory
root@66d4e20ec1d7:/par/llama.cpp/build# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake --build . --config Release
root@66d4e20ec1d7:/par/llama.cpp/build# /par/cmake/cmake-4.2.1-linux-aarch64/bin/cmake --build . --config Release
[ 0%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml.c.o
[ 0%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml.cpp.o
[ 0%] Building C object ggml/src/CMakeFiles/ggml-base.dir/ggml-alloc.c.o
[ 1%] Building CXX object ggml/src/CMakeFiles/ggml-base.dir/ggml-backend.cpp.o
...
[ 99%] Linking CXX executable ../../bin/llama-fit-params
[ 99%] Built target llama-fit-params
[ 99%] Building CXX object tools/results/CMakeFiles/llama-results.dir/results.cpp.o
[100%] Linking CXX executable ../../bin/llama-results
[100%] Built target llama-results
编译出的llama-cli能加载刚才下载的模型,接收用户对话。
root@66d4e20ec1d7:/par/llama.cpp/build/bin# ./llama-cli -m /par/models/Qwen3.5-0.8B-Q4_K_M.gguf --ctx-size 16384 -cnv
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b8245-d417bc43d
model : Qwen3.5-0.8B-Q4_K_M.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
>
再用llama-server启动服务器。
root@66d4e20ec1d7:/par/llama.cpp/build/bin# ./llama-server -m /par/models/Qwen3.5-0.8B-Q4_K_M.gguf --ctx-size 16384 --port 8080
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8245 (d417bc43d) with GNU 14.2.0 for Linux aarch64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 8
system_info: n_threads = 8 (n_threads_batch = 8) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 7 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/par/models/Qwen3.5-0.8B-Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: no devices with dedicated memory found
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.91 seconds
llama_model_loader: loaded meta data with 46 key-value pairs and 320 tensors from /par/models/Qwen3.5-0.8B-Q4_K_M.gguf (version GGUF V3 (latest))
...
llama_context: n_ctx_seq (16384) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 3.79 MiB
llama_kv_cache: CPU KV buffer size = 192.00 MiB
llama_kv_cache: size = 192.00 MiB ( 16384 cells, 6 layers, 4/1 seqs), K (f16): 96.00 MiB, V (f16): 96.00 MiB
llama_memory_recurrent: CPU RS buffer size = 77.06 MiB
llama_memory_recurrent: size = 77.06 MiB ( 4 cells, 24 layers, 4 seqs), R (f32): 5.06 MiB, S (f32): 72.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: CPU compute buffer size = 489.00 MiB
sched_reserve: graph nodes = 3123 (with bs=512), 1377 (with bs=1)
sched_reserve: graph splits = 1
sched_reserve: reserve took 11.13 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv load_model: initializing slots, n_slots = 4
common_speculative_is_compat: the target context does not support partial sequence removal
srv load_model: speculative decoding not supported by this context
slot load_model: id 0 | task -1 | new slot, n_ctx = 16384
slot load_model: id 1 | task -1 | new slot, n_ctx = 16384
slot load_model: id 2 | task -1 | new slot, n_ctx = 16384
slot load_model: id 3 | task -1 | new slot, n_ctx = 16384
srv load_model: prompt cache is enabled, size limit: 8192 MiB
srv load_model: use `--cache-ram 0` to disable the prompt cache
srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
srv init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv update_slots: all slots are idle
可以用curl命令连接服务器显示模型信息和提问。虽然答非所问。
root@66d4e20ec1d7:/par# curl http://localhost:8080/v1/models
{"models":[{"name":"Qwen3.5-0.8B-Q4_K_M.gguf","model":"Qwen3.5-0.8B-Q4_K_M.gguf","modified_at":"","size":"","digest":"","type":"model","description":"","tags":[""],"capabilities":["completion"],"parameters":"","details":{"parent_model":"","format":"gguf","family":"","families":[""],"parameter_size":"","quantization_level":""}}],"object":"list","data":[{"id":"Qwen3.5-0.8B-Q4_K_M.gguf","aliases":[],"tags":[],"object":"model","created":1773021697,"owned_by":"llamacpp","meta":{"vocab_type":2,"n_vocab":248320,"n_ctx_train":262144,"n_embd":1024,"n_params":752393024,"siz
curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen3.5-0.8B-Q4_K_M","prompt": "汉译英:人工智能的未来发展","max_tokens": 100, "temperature": 0.8}'
root@66d4e20ec1d7:/par# curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen3.5-0.8B-Q4_K_M","prompt": "汉译英:人工智能的未来发展","max_tokens": 100, "temperature": 0.8}'ature": 0.8}'alhost:8080/v1/completions -H "Content-Type: application/json" -d '{"model": "Qwen3.5-0.8B-Q4_K_M","prompt": "汉译英:人工智能的未来发展","
{"choices":[{"text":"\n\n<think>\n\n</think>\n\n人工智能的未来发展\n\n**1. 智能化与自主性**\n随着机器学习算法的不断迭代,人工智能在**自主性**上实现质的飞跃。从传统的被动执行指令,发展到能够**自主规划、自主决策**,甚至能够自主发现并优化自身的效率。这种能力的提升,标志着人工智能从"模仿人类"转向"模拟人类",是迈向深度智能的重要里程碑。\n\n**2. 复杂性与环境适应性**\n面对日益复杂的**自然","index":0,"logprobs":null,"finish_reason":"length"}],"created":1773022021,"model":"Qwen3.5-0.8B-Q4_K_M.gguf","system_fingerprint":"b8245-d417bc43d","object":"text_completion","usage":{"completion_tokens":100,"prompt_tokens":7,"total_tokens":107},"id":"chatcmpl-478Re5iVd6NvrWPvgRpTgm5pPVrNSN5U","timings":{"cache_n":0,"prompt_n":7,"prompt_ms":218.4,"prompt_per_token_ms":31.2,"prompt_per_second":32.05128205128205,"predicted_n":100,"predicted_ms":7099.046,"predicted_per_token_ms":70.99046,"predicted_per_second":14.08639977822372}}
root@66d4e20ec1d7:/par#