文章目录
-
- [一、关于 llama.cpp](#一、关于 llama.cpp)
- 二、Demo
-
- [1、Typical run using LLaMA v2 13B on M2 Ultra](#1、Typical run using LLaMA v2 13B on M2 Ultra)
- [2、Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook](#2、Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook)
- 三、用法
- 四、构建
- 五、支持的后端
- 六、工具
- 七、其他文件
一、关于 llama.cpp
- github : https://github.com/ggerganov/llama.cpp
- Roadmap / Project status / Manifesto / ggml
llama.cpp
的主要目标是 使LLM推理 具有最少的设置和最先进的性能,在各种硬件--本地和云端。
- 没有任何依赖关系的普通C/C++实现
- Apple silicon 是一等公民 -- 通过ARM NEON、Accelerate和Metal框架进行优化
- AVX、AVX2和AVX512支持x86架构
- 1.5位、2位、3位、4位、5位、6位和8位整数量化,用于更快的推理和减少内存使用
- 用于在NVIDIA GPU上运行LLM的自定义CUDA内核(通过HIP支持AMD GPU)
- Vulkan和SYCL后端支持
- CPU+GPU 混合推断 部分加速模型大于总VRAM容量
自启动以来,由于许多 contributions,该项目有了显著改善。
它是为ggml库 开发新功能的主要场所。
支持的模型:
通常也支持下面基本模型的细调。
- LLaMA 🦙
- LLaMA 2 🦙🦙
- LLaMA 3 🦙🦙🦙
- Mistral 7B
- Mixtral MoE
- DBRX
- Falcon
- Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
- Vigogne (French)
- BERT
- Koala
- Baichuan 1 & 2 + derivations
- Aquila 1 & 2
- Starcoder models
- Refact
- MPT
- Bloom
- Yi models
- StableLM models
- Deepseek models
- Qwen models
- PLaMo-13B
- Phi models
- GPT-2
- Orion 14B
- InternLM2
- CodeShell
- Gemma
- Mamba
- Grok-1
- Xverse
- Command-R models
- SEA-LION
- GritLM-7B + GritLM-8x7B
- OLMo
- GPT-NeoX + Pythia
- ChatGLM3-6b + ChatGLM4-9b
(instructions for supporting more models: HOWTO-add-model.md)
Multimodal models:
- LLaVA 1.5 models, LLaVA 1.6 models
- BakLLaVA
- Obsidian
- ShareGPT4V
- MobileVLM 1.7B/3B models
- Yi-VL
- Mini CPM
- Moondream
- Bunny
Bindings:
- Python: abetlen/llama-cpp-python
- Go: go-skynet/go-llama.cpp
- Node.js: withcatai/node-llama-cpp
- JS/TS (llama.cpp server client): lgrammel/modelfusion
- JavaScript/Wasm (works in browser): tangledgroup/llama-cpp-wasm
- Typescript/Wasm (nicer API, available on npm): ngxson/wllama
- Ruby: yoshoku/llama_cpp.rb
- Rust (more features): edgenai/llama_cpp-rs
- Rust (nicer API): mdrokz/rust-llama.cpp
- Rust (more direct bindings): utilityai/llama-cpp-rs
- C#/.NET: SciSharp/LLamaSharp
- Scala 3: donderom/llm4s
- Clojure: phronmophobic/llama.clj
- React Native: mybigday/llama.rn
- Java: kherud/java-llama.cpp
- Zig: deins/llama.cpp.zig
- Flutter/Dart: netdur/llama_cpp_dart
- PHP (API bindings and features built on top of llama.cpp): distantmagic/resonance (more info)
- Guile Scheme: guile_llama_cpp
UI:
除非另有说明,否则这些项目是具有许可的开源项目:
- iohub/collama
- janhq/jan (AGPL)
- nat/openplayground
- Faraday (proprietary)
- LMStudio (proprietary)
- Layla (proprietary)
- LocalAI (MIT)
- LostRuins/koboldcpp (AGPL)
- Mozilla-Ocho/llamafile
- nomic-ai/gpt4all
- ollama/ollama
- oobabooga/text-generation-webui (AGPL)
- psugihara/FreeChat
- cztomsik/ava (MIT)
- ptsochantaris/emeltal
- pythops/tenere (AGPL)
- RAGNA Desktop (proprietary)
- RecurseChat (proprietary)
- semperai/amica
- withcatai/catai
- Mobile-Artificial-Intelligence/maid (MIT)
- Msty (proprietary)
- LLMFarm (MIT)
- KanTV(Apachev2.0 or later)
- Dot (GPL)
- MindMac (proprietary)
- KodiBot (GPL)
- eva (MIT)
- AI Sublime Text plugin (MIT)
- AIKit (MIT)
- LARS - The LLM & Advanced Referencing Solution (AGPL)
(要在此处列出一个项目,它应该明确说明它依赖于 llama.cpp
)
Tools:
- akx/ggify -- 从 HuggingFace Hub 下载 PyTorch 模型,然后转化他们到 GGML
- crashr/gppm -- 使用 NVIDIA Tesla P40 或 P100 GPU 加载 llama.cpp实例,降低空闲功耗
Infrastructure:
- Paddler - 为llama.cpp定制的状态负载均衡器
二、Demo
1、Typical run using LLaMA v2 13B on M2 Ultra
python
$ make -j && ./llama-cli -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./common -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
make: Nothing to be done for `default'.
main: build = 1041 (cf658ad)
main: seed = 1692823051
llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from models/llama-13b-v2/ggml-model-q4_0.gguf (version GGUF V1 (latest))
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q4_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_print_meta: format = GGUF V1 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_ctx = 512
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: freq_base = 10000.0
llm_load_print_meta: freq_scale = 1
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model size = 13.02 B
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.11 MB
llm_load_tensors: mem required = 7024.01 MB (+ 400.00 MB per state)
...................................................................................................
llama_new_context_with_model: kv self size = 400.00 MB
llama_new_context_with_model: compute buffer total size = 75.41 MB
system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0
Building a website can be done in 10 simple steps:
Step 1: Find the right website platform.
Step 2: Choose your domain name and hosting plan.
Step 3: Design your website layout.
Step 4: Write your website content and add images.
Step 5: Install security features to protect your site from hackers or spammers
Step 6: Test your website on multiple browsers, mobile devices, operating systems etc...
Step 7: Test it again with people who are not related to you personally -- friends or family members will work just fine!
Step 8: Start marketing and promoting the website via social media channels or paid ads
Step 9: Analyze how many visitors have come to your site so far, what type of people visit more often than others (e.g., men vs women) etc...
Step 10: Continue to improve upon all aspects mentioned above by following trends in web design and staying up-to-date on new technologies that can enhance user experience even further!
How does a Website Work?
A website works by having pages, which are made of HTML code. This code tells your computer how to display the content on each page you visit -- whether it's an image or text file (like PDFs). In order for someone else's browser not only be able but also want those same results when accessing any given URL; some additional steps need taken by way of programming scripts that will add functionality such as making links clickable!
The most common type is called static HTML pages because they remain unchanged over time unless modified manually (either through editing files directly or using an interface such as WordPress). They are usually served up via HTTP protocols -- this means anyone can access them without having any special privileges like being part of a group who is allowed into restricted areas online; however, there may still exist some limitations depending upon where one lives geographically speaking.
How to
llama_print_timings: load time = 576.45 ms
llama_print_timings: sample time = 283.10 ms / 400 runs ( 0.71 ms per token, 1412.91 tokens per second)
llama_print_timings: prompt eval time = 599.83 ms / 19 tokens ( 31.57 ms per token, 31.68 tokens per second)
llama_print_timings: eval time = 24513.59 ms / 399 runs ( 61.44 ms per token, 16.28 tokens per second)
llama_print_timings: total time = 25431.49 ms
2、Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook
这是运行LLaMA-7B和 whisper.cpp 的另一个演示,在一台M1 Pro MacBook上:
三、用法
以下是大多数受支持模型的 端到端二进制构建 和 模型转换步骤。
1、基本用法
首先,您需要获取二进制文件。您可以遵循不同的方法:
- 方法一:克隆此仓库并本地构建,看如何构建
- 方法二:如果你使用的是MacOS或Linux,你可以通过brew、flox或nix 安装 llama. cpp
- 方法3:使用Docker镜像,见为Docker留档
- 方法4:从 releases 下载预构建二进制
您可以使用以下命令运行基本完成:
shell
llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128
# Output:
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga -- it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.
有关参数的完整列表,请参阅此页面。
2、对话模式
如果您想要更ChatGPT的体验,可以通过将-cnv
作为参数传递来 在对话模式下运行:
shell
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv
# Output:
# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!
默认情况下,聊天模板将取自输入模型。如果要使用另一个聊天模板,传递--chat-template NAME
参数。查看支持模板
shell
./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml
您还可以通过前缀、后缀 和 反向提示参数 使用自己的模板:
shell
./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
3、网络服务
llama. cpp Web服务是一个轻量级的OpenAI API兼容HTTP服务器,可用于服务本地模型 并轻松将它们连接到现有客户端。
示例用法:
shell
./llama-server -m your_model.gguf --port 8080
# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
4、交互模式
注:如果您更喜欢基本用法,请考虑使用对话模式而不是交互模式
在这种模式下,您始终可以通过按Ctrl+C,并输入一行或多行文本来中断生成,这些文本将被转换为 tokens 并附加到当前上下文中。
您还可以使用参数-r "reverse prompt string"
指定反向提示。这将导致每当在生成中遇到反向提示字符串的确切 tokens 时都会提示用户输入。
一个典型的用途是使用一个提示,使LLaMA模拟多个用户之间的聊天,比如说Alice和Bob,并传递-r "Alice:"
。
这是一个使用命令调用的 few-shot 交互示例
shell
# default arguments using a 7B model
./examples/chat.sh
# advanced chat with a 13B model
./examples/chat-13B.sh
# custom arguments using a 13B model
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt
注意使用--color
来区分用户输入和生成的文本。
llama-cli
示例程序的README中更详细地解释了其他参数。
5、持久互动
提示符、用户输入和模型代可以通过调用./llama-cli
来保存和恢复--prompt-cache
和--prompt-cache-all
。./examples/chat-persistent.sh
脚本演示了这一点,支持长时间运行的、可恢复的聊天会话。
要使用此示例,您必须提供一个文件来缓存初始聊天提示符和一个目录来保存聊天会话,并且可以选择提供与chat-13B.sh
相同的提示缓存可以重复用于新的聊天会话。
请注意,提示缓存和聊天目录都绑定到初始提示符(PROMPT_TEMPLATE
)和模型文件。
shell
# Start a new chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh
# Resume that chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh
# Start a different chat with the same prompt/model
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/another ./examples/chat-persistent.sh
# Different prompt cache for different prompt/model
PROMPT_TEMPLATE=./prompts/chat-with-bob.txt PROMPT_CACHE_FILE=bob.prompt.bin \
CHAT_SAVE_DIR=./chat/bob ./examples/chat-persistent.sh
6、语法约束输出
llama.cpp
支持约束模型输出的语法。例如,您可以强制模型仅输出JSON:
shell
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
grammars/
文件夹包含一些示例语法。要编写自己的语法,请查看GBNF指南。
要编写更复杂的JSON语法,您还可以查看https://grammar.intrinsiclabs.ai/,这是一个浏览器应用程序,可让您编写TypeScript接口,并将其编译为GBNF语法,您可以保存以供本地使用。
请注意,该应用程序是由社区成员构建和维护的,请在其存储库上提交任何问题或FR,而不是这个。
四、构建
五、支持的后端
Backend | Target devices |
---|---|
Metal | Apple Silicon |
BLAS | All |
BLIS | All |
SYCL | Intel and Nvidia GPU |
CUDA | Nvidia GPU |
hipBLAS | AMD GPU |
Vulkan | GPU |
六、工具
1、准备和量化
注:你可以使用Hugging Face 上的 GGUF-my-repo空间 量化你的模型权重,不需要任何设置。它每6小时从llama.cpp
同步一次。
要获得官方的LLaMA 2权重,请参阅 Obtaining and using the Facebook LLaMA 2 model 部分。
Hugging Face 上还有大量预量化gguf
模型可供选择。
注意:convert.py
已移至examples/convert_legacy_llama.py
,不应用于Llama/Llama2/Mistral
模型及其衍生产品以外的任何内容。 它不支持 LLaMA 3,您可以使用convert_hf_to_gguf.py
从 Hugging Face 下载 LLaMA 3。
要了解有关量化模型的更多信息,请阅读此留档
2、困惑(测量模型质量)
您可以使用perplexity
示例 来测量给定提示的困惑(困惑程度越低越好)。
有关详细信息,请参见https://huggingface.co/docs/transformers/perplexity。
要了解更多如何使用llama. cpp测量困惑,请阅读此留档
七、其他文件
开发文件
关于模型的开创性论文和背景
如果您的问题是模型生成质量,那么请至少扫描以下链接和论文以了解LLaMA模型的局限性。在选择适当的模型尺寸并欣赏LLaMA模型和ChatGPT之间的显着和细微差异时,这一点尤其重要:
- LLaMA
- GPT-3
- GPT-3.5 / InstructGPT / ChatGPT:
2024-07