llama.cpp

文章目录


一、关于 llama.cpp


llama.cpp的主要目标是 使LLM推理 具有最少的设置和最先进的性能,在各种硬件--本地和云端。

  • 没有任何依赖关系的普通C/C++实现
  • Apple silicon 是一等公民 -- 通过ARM NEON、Accelerate和Metal框架进行优化
  • AVX、AVX2和AVX512支持x86架构
  • 1.5位、2位、3位、4位、5位、6位和8位整数量化,用于更快的推理和减少内存使用
  • 用于在NVIDIA GPU上运行LLM的自定义CUDA内核(通过HIP支持AMD GPU)
  • Vulkan和SYCL后端支持
  • CPU+GPU 混合推断 部分加速模型大于总VRAM容量

启动以来,由于许多 contributions,该项目有了显著改善。

它是为ggml库 开发新功能的主要场所。


支持的模型:

通常也支持下面基本模型的细调。

(instructions for supporting more models: HOWTO-add-model.md)


Multimodal models:


Bindings:


UI:

除非另有说明,否则这些项目是具有许可的开源项目:

(要在此处列出一个项目,它应该明确说明它依赖于 llama.cpp)


Tools:

  • akx/ggify -- 从 HuggingFace Hub 下载 PyTorch 模型,然后转化他们到 GGML
  • crashr/gppm -- 使用 NVIDIA Tesla P40 或 P100 GPU 加载 llama.cpp实例,降低空闲功耗

Infrastructure:

  • Paddler - 为llama.cpp定制的状态负载均衡器

二、Demo


1、Typical run using LLaMA v2 13B on M2 Ultra

python 复制代码
$ make -j && ./llama-cli -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.            -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./common -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: build = 1041 (cf658ad)
main: seed  = 1692823051
llama_model_loader: loaded meta data with 16 key-value pairs and 363 tensors from models/llama-13b-v2/ggml-model-q4_0.gguf (version GGUF V1 (latest))
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_0:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_print_meta: format         = GGUF V1 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 512
llm_load_print_meta: n_embd         = 5120
llm_load_print_meta: n_head         = 40
llm_load_print_meta: n_head_kv      = 40
llm_load_print_meta: n_layer        = 40
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 1
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 13824
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 13B
llm_load_print_meta: model ftype    = mostly Q4_0
llm_load_print_meta: model size     = 13.02 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MB
llm_load_tensors: mem required  = 7024.01 MB (+  400.00 MB per state)
...................................................................................................
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   75.41 MB

system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 Building a website can be done in 10 simple steps:
Step 1: Find the right website platform.
Step 2: Choose your domain name and hosting plan.
Step 3: Design your website layout.
Step 4: Write your website content and add images.
Step 5: Install security features to protect your site from hackers or spammers
Step 6: Test your website on multiple browsers, mobile devices, operating systems etc...
Step 7: Test it again with people who are not related to you personally -- friends or family members will work just fine!
Step 8: Start marketing and promoting the website via social media channels or paid ads
Step 9: Analyze how many visitors have come to your site so far, what type of people visit more often than others (e.g., men vs women) etc...
Step 10: Continue to improve upon all aspects mentioned above by following trends in web design and staying up-to-date on new technologies that can enhance user experience even further!
How does a Website Work?
A website works by having pages, which are made of HTML code. This code tells your computer how to display the content on each page you visit -- whether it's an image or text file (like PDFs). In order for someone else's browser not only be able but also want those same results when accessing any given URL; some additional steps need taken by way of programming scripts that will add functionality such as making links clickable!
The most common type is called static HTML pages because they remain unchanged over time unless modified manually (either through editing files directly or using an interface such as WordPress). They are usually served up via HTTP protocols -- this means anyone can access them without having any special privileges like being part of a group who is allowed into restricted areas online; however, there may still exist some limitations depending upon where one lives geographically speaking.
How to
llama_print_timings:        load time =   576.45 ms
llama_print_timings:      sample time =   283.10 ms /   400 runs   (    0.71 ms per token,  1412.91 tokens per second)
llama_print_timings: prompt eval time =   599.83 ms /    19 tokens (   31.57 ms per token,    31.68 tokens per second)
llama_print_timings:        eval time = 24513.59 ms /   399 runs   (   61.44 ms per token,    16.28 tokens per second)
llama_print_timings:       total time = 25431.49 ms


2、Demo of running both LLaMA-7B and whisper.cpp on a single M1 Pro MacBook

这是运行LLaMA-7B和 whisper.cpp 的另一个演示,在一台M1 Pro MacBook上:

https://private-user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8b4f-add84093ffff.mp4


三、用法

以下是大多数受支持模型的 端到端二进制构建 和 模型转换步骤。


1、基本用法

首先,您需要获取二进制文件。您可以遵循不同的方法:

  • 方法一:克隆此仓库并本地构建,看如何构建
  • 方法二:如果你使用的是MacOS或Linux,你可以通过brew、flox或nix 安装 llama. cpp
  • 方法3:使用Docker镜像,见为Docker留档
  • 方法4:从 releases 下载预构建二进制

您可以使用以下命令运行基本完成:

shell 复制代码
llama-cli -m your_model.gguf -p "I believe the meaning of life is" -n 128

# Output:
# I believe the meaning of life is to find your own truth and to live in accordance with it. For me, this means being true to myself and following my passions, even if they don't align with societal expectations. I think that's what I love about yoga -- it's not just a physical practice, but a spiritual one too. It's about connecting with yourself, listening to your inner voice, and honoring your own unique journey.

有关参数的完整列表,请参阅此页面


2、对话模式

如果您想要更ChatGPT的体验,可以通过将-cnv作为参数传递来 在对话模式下运行:

shell 复制代码
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv

# Output:
# > hi, who are you?
# Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
#
# > what is 1+1?
# Easy peasy! The answer to 1+1 is... 2!

默认情况下,聊天模板将取自输入模型。如果要使用另一个聊天模板,传递--chat-template NAME参数。查看支持模板

shell 复制代码
./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --chat-template chatml

您还可以通过前缀、后缀 和 反向提示参数 使用自己的模板:

shell 复制代码
./llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv --in-prefix 'User: ' --reverse-prompt 'User:'

3、网络服务

llama. cpp Web服务是一个轻量级的OpenAI API兼容HTTP服务器,可用于服务本地模型 并轻松将它们连接到现有客户端。

示例用法:

shell 复制代码
./llama-server -m your_model.gguf --port 8080

# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions

4、交互模式

注:如果您更喜欢基本用法,请考虑使用对话模式而不是交互模式

在这种模式下,您始终可以通过按Ctrl+C,并输入一行或多行文本来中断生成,这些文本将被转换为 tokens 并附加到当前上下文中。

您还可以使用参数-r "reverse prompt string"指定反向提示。这将导致每当在生成中遇到反向提示字符串的确切 tokens 时都会提示用户输入。

一个典型的用途是使用一个提示,使LLaMA模拟多个用户之间的聊天,比如说Alice和Bob,并传递-r "Alice:"

这是一个使用命令调用的 few-shot 交互示例

shell 复制代码
# default arguments using a 7B model
./examples/chat.sh

# advanced chat with a 13B model
./examples/chat-13B.sh

# custom arguments using a 13B model
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --repeat_penalty 1.0 --color -i -r "User:" -f prompts/chat-with-bob.txt

注意使用--color来区分用户输入和生成的文本。

llama-cli示例程序的README中更详细地解释了其他参数。



5、持久互动

提示符、用户输入和模型代可以通过调用./llama-cli来保存和恢复--prompt-cache--prompt-cache-all./examples/chat-persistent.sh脚本演示了这一点,支持长时间运行的、可恢复的聊天会话。

要使用此示例,您必须提供一个文件来缓存初始聊天提示符和一个目录来保存聊天会话,并且可以选择提供与chat-13B.sh相同的提示缓存可以重复用于新的聊天会话。

请注意,提示缓存和聊天目录都绑定到初始提示符(PROMPT_TEMPLATE)和模型文件。

shell 复制代码
# Start a new chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh

# Resume that chat
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/default ./examples/chat-persistent.sh

# Start a different chat with the same prompt/model
PROMPT_CACHE_FILE=chat.prompt.bin CHAT_SAVE_DIR=./chat/another ./examples/chat-persistent.sh

# Different prompt cache for different prompt/model
PROMPT_TEMPLATE=./prompts/chat-with-bob.txt PROMPT_CACHE_FILE=bob.prompt.bin \
    CHAT_SAVE_DIR=./chat/bob ./examples/chat-persistent.sh

6、语法约束输出

llama.cpp支持约束模型输出的语法。例如,您可以强制模型仅输出JSON:

shell 复制代码
./llama-cli -m ./models/13B/ggml-model-q4_0.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'

grammars/文件夹包含一些示例语法。要编写自己的语法,请查看GBNF指南

要编写更复杂的JSON语法,您还可以查看https://grammar.intrinsiclabs.ai/,这是一个浏览器应用程序,可让您编写TypeScript接口,并将其编译为GBNF语法,您可以保存以供本地使用。

请注意,该应用程序是由社区成员构建和维护的,请在其存储库上提交任何问题或FR,而不是这个。


四、构建

请参考本地Build llama. cpp


五、支持的后端

Backend Target devices
Metal Apple Silicon
BLAS All
BLIS All
SYCL Intel and Nvidia GPU
CUDA Nvidia GPU
hipBLAS AMD GPU
Vulkan GPU

六、工具


1、准备和量化

注:你可以使用Hugging Face 上的 GGUF-my-repo空间 量化你的模型权重,不需要任何设置。它每6小时从llama.cpp同步一次。

要获得官方的LLaMA 2权重,请参阅 Obtaining and using the Facebook LLaMA 2 model 部分。

Hugging Face 上还有大量预量化gguf模型可供选择。

注意:convert.py已移至examples/convert_legacy_llama.py,不应用于Llama/Llama2/Mistral模型及其衍生产品以外的任何内容。 它不支持 LLaMA 3,您可以使用convert_hf_to_gguf.py从 Hugging Face 下载 LLaMA 3。

要了解有关量化模型的更多信息,请阅读此留档


2、困惑(测量模型质量)

您可以使用perplexity示例 来测量给定提示的困惑(困惑程度越低越好)。

有关详细信息,请参见https://huggingface.co/docs/transformers/perplexity

要了解更多如何使用llama. cpp测量困惑,请阅读此留档


七、其他文件


开发文件


关于模型的开创性论文和背景

如果您的问题是模型生成质量,那么请至少扫描以下链接和论文以了解LLaMA模型的局限性。在选择适当的模型尺寸并欣赏LLaMA模型和ChatGPT之间的显着和细微差异时,这一点尤其重要:


2024-07

相关推荐
Jerry Lau1 小时前
大模型-本地化部署调用--基于ollama+openWebUI+springBoot
java·spring boot·后端·llama
斯多葛的信徒3 小时前
看看你的电脑可以跑 AI 模型吗?
人工智能·语言模型·电脑·llama
网安打工仔3 小时前
斯坦福李飞飞最新巨著《AI Agent综述》
人工智能·自然语言处理·大模型·llm·agent·ai大模型·大模型入门
AGI学习社3 小时前
2024中国排名前十AI大模型进展、应用案例与发展趋势
linux·服务器·人工智能·华为·llama
猿类崛起@3 小时前
百度千帆大模型实战:AI大模型开发的调用指南
人工智能·学习·百度·大模型·产品经理·大模型学习·大模型教程
黑客-雨3 小时前
从零开始:如何用Python训练一个AI模型(超详细教程)非常详细收藏我这一篇就够了!
开发语言·人工智能·python·大模型·ai产品经理·大模型学习·大模型入门
周杰伦_Jay20 小时前
Ollama能本地部署Llama 3等大模型的原因解析(ollama核心架构、技术特性、实际应用)
数据结构·人工智能·深度学习·架构·transformer·llama
Ai多利1 天前
2025发文新方向:AI+量化 人工智能与金融完美融合!
人工智能·ai·金融·量化
Yongqiang Cheng2 天前
llama.cpp LLM_ARCH_NAMES
llama.cpp·arch_names
Allen200003 天前
wow-agent---task2使用llama-index创建Agent
人工智能·llama