llama.cpp 是一个用 C/C++ 实现的高性能推理框架,能在普通电脑或其他嵌入式系统上高效运行量化后的模型。本文介绍如何编译llama.cpp 使在普通电脑上也能跑起来。
搭建环境
cpp
~$ uname -a
5.15.0-139-generic #149~20.04.1-Ubuntu SMP Wed Apr 16 08:29:56 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
~$ nvidia-smi
Sun Apr 5 20:06:46 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA T600 Laptop GPU Off | 00000000:01:00.0 Off | N/A |
| N/A 45C P8 N/A / 5001W | 3595MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1782 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 3149 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 268153 C ./build/bin/llama-server 3582MiB |
+-----------------------------------------------------------------------------------------+
环境准备
cmake
ubuntu 20 系统默认安装的cmake版本较低,需要安装3.18 版本以上
从github下载后手动安装
https://github.com/Kitware/CMake/releases
nvcc
ubuntu 20 系统默认安装的nvcc版本也较低,安装高版本方法如下
cpp
sudo apt remove nvidia-cuda-toolkit
sudo apt autoremove
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-toolkit-12-8
配置
cpp
# 将以下内容添加到 ~/.bashrc
export PATH=/usr/local/cuda-12.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
source ~/.bashrc
验证
cpp
nvcc --version # 应显示 12.8
which nvcc # 应指向 /usr/local/cuda-12.8/bin/nvcc
llama.cpp编译
下载
https://codeload.github.com/ggml-org/llama.cpp/tar.gz/refs/tags/b8642
编译
cpp
export CUDACXX=/usr/local/cuda-12.8/bin/nvcc
cmake -B build -DLLAMA_CUDA=1 -DLLAMA_CURL=1 -DBUILD_SHARED_LIBS=OFF DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.8/bin/nvcc
cmake --build build --config Release
产物
cpp
llama.cpp-b8642$ ls build/bin/
export-graph-ops llama-embedding llama-imatrix llama-passkey llama-speculative test-barrier test-json-partial test-quantize-stats
llama-batched llama-eval-callback llama-llava-cli llama-perplexity llama-speculative-simple test-c test-json-schema-to-grammar test-reasoning-budget
llama-batched-bench llama-export-lora llama-lookahead llama-q8dot llama-template-analysis test-chat test-llama-archs test-regex-partial
llama-bench llama-finetune llama-lookup llama-quantize llama-tokenize test-chat-auto-parser test-llama-grammar test-rope
llama-cli llama-fit-params llama-lookup-create llama-qwen2vl-cli llama-tts test-chat-peg-parser test-log test-sampling
llama-completion llama-gemma3-cli llama-lookup-merge llama-results llama-vdot test-chat-template test-model-load-cancel test-state-restore-fragmented
llama-convert-llama2c-to-ggml llama-gen-docs llama-lookup-stats llama-retrieval test-alloc test-gbnf-validator test-mtmd-c-api test-thread-safety
llama-cvector-generator llama-gguf llama-minicpmv-cli llama-save-load-state test-arg-parser test-gguf test-opt test-tokenizer-0
llama-debug llama-gguf-hash llama-mtmd-cli llama-server test-autorelease test-grammar-integration test-peg-parser test-tokenizer-1-bpe
llama-debug-template-parser llama-gguf-split llama-mtmd-debug llama-simple test-backend-ops test-grammar-parser test-quantize-fns test-tokenizer-1-spm
llama-diffusion-cli llama-idle llama-parallel llama-simple-chat test-backend-sampler test-jinja test-quantize-perf