编译llama.cpp

llama.cpp 是一个用 C/C++ 实现的高性能推理框架,能在普通电脑或其他嵌入式系统上高效运行量化后的模型。本文介绍如何编译llama.cpp 使在普通电脑上也能跑起来。

搭建环境

cpp 复制代码
~$ uname -a
 5.15.0-139-generic #149~20.04.1-Ubuntu SMP Wed Apr 16 08:29:56 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
~$ nvidia-smi
Sun Apr  5 20:06:46 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T600 Laptop GPU         Off |   00000000:01:00.0 Off |                  N/A |
| N/A   45C    P8            N/A  / 5001W |    3595MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1782      G   /usr/lib/xorg/Xorg                        4MiB |
|    0   N/A  N/A            3149      G   /usr/lib/xorg/Xorg                        4MiB |
|    0   N/A  N/A          268153      C   ./build/bin/llama-server               3582MiB |
+-----------------------------------------------------------------------------------------+

环境准备

cmake

ubuntu 20 系统默认安装的cmake版本较低,需要安装3.18 版本以上

从github下载后手动安装

https://github.com/Kitware/CMake/releases

nvcc

ubuntu 20 系统默认安装的nvcc版本也较低,安装高版本方法如下

cpp 复制代码
sudo apt remove nvidia-cuda-toolkit
sudo apt autoremove
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-toolkit-12-8

配置

cpp 复制代码
# 将以下内容添加到 ~/.bashrc
export PATH=/usr/local/cuda-12.8/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH
source ~/.bashrc

验证

cpp 复制代码
nvcc --version   # 应显示 12.8
which nvcc        # 应指向 /usr/local/cuda-12.8/bin/nvcc

llama.cpp编译

下载

https://codeload.github.com/ggml-org/llama.cpp/tar.gz/refs/tags/b8642

编译

cpp 复制代码
export CUDACXX=/usr/local/cuda-12.8/bin/nvcc
cmake -B build -DLLAMA_CUDA=1 -DLLAMA_CURL=1 -DBUILD_SHARED_LIBS=OFF DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.8/bin/nvcc
cmake --build build --config Release

产物

cpp 复制代码
llama.cpp-b8642$ ls build/bin/
export-graph-ops               llama-embedding      llama-imatrix        llama-passkey          llama-speculative         test-barrier              test-json-partial            test-quantize-stats
llama-batched                  llama-eval-callback  llama-llava-cli      llama-perplexity       llama-speculative-simple  test-c                    test-json-schema-to-grammar  test-reasoning-budget
llama-batched-bench            llama-export-lora    llama-lookahead      llama-q8dot            llama-template-analysis   test-chat                 test-llama-archs             test-regex-partial
llama-bench                    llama-finetune       llama-lookup         llama-quantize         llama-tokenize            test-chat-auto-parser     test-llama-grammar           test-rope
llama-cli                      llama-fit-params     llama-lookup-create  llama-qwen2vl-cli      llama-tts                 test-chat-peg-parser      test-log                     test-sampling
llama-completion               llama-gemma3-cli     llama-lookup-merge   llama-results          llama-vdot                test-chat-template        test-model-load-cancel       test-state-restore-fragmented
llama-convert-llama2c-to-ggml  llama-gen-docs       llama-lookup-stats   llama-retrieval        test-alloc                test-gbnf-validator       test-mtmd-c-api              test-thread-safety
llama-cvector-generator        llama-gguf           llama-minicpmv-cli   llama-save-load-state  test-arg-parser           test-gguf                 test-opt                     test-tokenizer-0
llama-debug                    llama-gguf-hash      llama-mtmd-cli       llama-server           test-autorelease          test-grammar-integration  test-peg-parser              test-tokenizer-1-bpe
llama-debug-template-parser    llama-gguf-split     llama-mtmd-debug     llama-simple           test-backend-ops          test-grammar-parser       test-quantize-fns            test-tokenizer-1-spm
llama-diffusion-cli            llama-idle           llama-parallel       llama-simple-chat      test-backend-sampler      test-jinja                test-quantize-perf
相关推荐
AI大模型..21 小时前
数据洞察加速器:LLM Copilot 如何让 SQL 查询效率提升 50% 以上?
人工智能·langchain·llm·agent·llama
l1t1 天前
用llama试用gemma-4-E2B模型量化版本
人工智能·llama·gemma
Flying pigs~~1 天前
主流大模型介绍(GPT、Llama、ChatGLM、Qwen、deepseek)
gpt·chatgpt·llm·llama·moe·deepseek·混合专家模式
小超同学你好1 天前
Transformer 21. 从 LLaMA 到 Qwen:Rotary Position Embedding(RoPE)与 YaRN 一文读懂
语言模型·架构·transformer·llama
belldeep1 天前
AI: llama.cpp 编译成功后,入门教程
python·ai·llama·llama-cpp
小驴程序源2 天前
【OpenClaw 完整安装实施教程(Windows + Ollama 本地模型)】
gpt·langchain·aigc·embedding·ai编程·llama·gpu算力
CHPCWWHSU2 天前
深入 llama.cpp:词汇表与分词——从文本到 Token (4)
人工智能·llm·llama·cpp·cudatoolkit
最贪吃的虎3 天前
我的第一个 RAG 程序:从 0 到 1,用 PDF 搭一个最小可运行的知识库问答系统
人工智能·python·算法·机器学习·aigc·embedding·llama
码农的神经元4 天前
从零实现 LLaMA 架构:一步步构建轻量级大语言模型
人工智能·语言模型·llama