🚀 LightRAG:简单快速的检索增强生成
该存储库托管了 LightRAG 的代码。该代码的结构基于nano-graphrag。 请添加图片描述
🎉 新闻
[2024.10.29]🎯📢LightRAG 现在支持多种文件类型,包括 PDF、DOC、PPT 和 CSV textract。
[2024.10.20]🎯📢我们为LightRAG添加了一项新功能:图形可视化。
[2024.10.18]🎯📢我们添加了LightRAG介绍视频的链接。感谢作者!
[2024.10.17]🎯📢我们创建了Discord频道!欢迎加入分享和讨论!🎉🎉
[2024.10.16]🎯📢LightRAG 现在支持Ollama 模型!
[2024.10.15]🎯📢LightRAG 现已支持Hugging Face 模型!
算法流程图
安装
- 从源码安装(推荐)
bash
cd LightRAG
pip install -e .
- 从 PyPI 安装
bash
pip install lightrag-hku
快速入门
- 在本地运行 LightRAG 的视频演示。
- 所有代码均可在 中找到examples。
- 如果使用 OpenAI 模型,请在环境中设置 OpenAI API 密钥:export OPENAI_API_KEY="sk-...".
- 下载演示文本"查尔斯·狄更斯的《圣诞颂歌》":
bash
curl https://raw.githubusercontent.com/gusye1234/nano-graphrag/main/tests/mock_data.txt > ./book.txt
使用以下 Python 代码片段(在脚本中)初始化 LightRAG 并执行查询:
python
import os
from lightrag import LightRAG, QueryParam
from lightrag.llm import gpt_4o_mini_complete, gpt_4o_complete
#########
# Uncomment the below two lines if running in a jupyter notebook to handle the async nature of rag.insert()
# import nest_asyncio
# nest_asyncio.apply()
#########
WORKING_DIR = "./dickens"
if not os.path.exists(WORKING_DIR):
os.mkdir(WORKING_DIR)
rag = LightRAG(
working_dir=WORKING_DIR,
llm_model_func=gpt_4o_mini_complete # Use gpt_4o_mini_complete LLM model
# llm_model_func=gpt_4o_complete # Optionally, use a stronger model
)
with open("./book.txt") as f:
rag.insert(f.read())
# Perform naive search
print(rag.query("What are the top themes in this story?", param=QueryParam(mode="naive")))
# Perform local search
print(rag.query("What are the top themes in this story?", param=QueryParam(mode="local")))
# Perform global search
print(rag.query("What are the top themes in this story?", param=QueryParam(mode="global")))
# Perform hybrid search
print(rag.query("What are the top themes in this story?", param=QueryParam(mode="hybrid")))
使用类似开放 AI 的 API
- LightRAG 还支持类似开放 AI 的聊天/嵌入 API:
python
async def llm_model_func(
prompt, system_prompt=None, history_messages=[], **kwargs
) -> str:
return await openai_complete_if_cache(
"solar-mini",
prompt,
system_prompt=system_prompt,
history_messages=history_messages,
api_key=os.getenv("UPSTAGE_API_KEY"),
base_url="https://api.upstage.ai/v1/solar",
**kwargs
)
async def embedding_func(texts: list[str]) -> np.ndarray:
return await openai_embedding(
texts,
model="solar-embedding-1-large-query",
api_key=os.getenv("UPSTAGE_API_KEY"),
base_url="https://api.upstage.ai/v1/solar"
)
rag = LightRAG(
working_dir=WORKING_DIR,
llm_model_func=llm_model_func,
embedding_func=EmbeddingFunc(
embedding_dim=4096,
max_token_size=8192,
func=embedding_func
)
)
使用 Hugging Face 模型
- 如果要使用Hugging Face模型,只需要如下设置LightRAG:
python
from lightrag.llm import hf_model_complete, hf_embedding
from transformers import AutoModel, AutoTokenizer
# Initialize LightRAG with Hugging Face model
rag = LightRAG(
working_dir=WORKING_DIR,
llm_model_func=hf_model_complete, # Use Hugging Face model for text generation
llm_model_name='meta-llama/Llama-3.1-8B-Instruct', # Model name from Hugging Face
# Use Hugging Face embedding function
embedding_func=EmbeddingFunc(
embedding_dim=384,
max_token_size=5000,
func=lambda texts: hf_embedding(
texts,
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
embed_model=AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
)
),
)
使用 Ollama 模型
概述
如果您想使用 Ollama 模型,您需要拉取您计划使用的模型和嵌入模型,例如nomic-embed-text。
然后你只需要按如下方式设置LightRAG:
python
from lightrag.llm import ollama_model_complete, ollama_embedding
# Initialize LightRAG with Ollama model
rag = LightRAG(
working_dir=WORKING_DIR,
llm_model_func=ollama_model_complete, # Use Ollama model for text generation
llm_model_name='your_model_name', # Your model name
# Use Ollama embedding function
embedding_func=EmbeddingFunc(
embedding_dim=768,
max_token_size=8192,
func=lambda texts: ollama_embedding(
texts,
embed_model="nomic-embed-text"
)
),
)
增加上下文大小
为了使 LightRAG 正常工作,上下文至少应为 32k 个标记。默认情况下,Ollama 模型的上下文大小为 8k。您可以使用以下两种方式之一实现此目的:
增加num_ctxModelfile中的参数。
- 拉取模型:
bash
ollama pull qwen2
- 显示模型文件:
bash
ollama show --modelfile qwen2 > Modelfile
- 编辑模型文件,添加以下行:
bash
PARAMETER num_ctx 32768
- 创建修改后的模型:
bash
ollama create -f Modelfile qwen2m
num_ctx通过 Ollama API设置。
Tiy 可以使用llm_model_kwargsparam 来配置 ollama:
python
rag = LightRAG(
working_dir=WORKING_DIR,
llm_model_func=ollama_model_complete, # Use Ollama model for text generation
llm_model_name='your_model_name', # Your model name
llm_model_kwargs={"options": {"num_ctx": 32768}},
# Use Ollama embedding function
embedding_func=EmbeddingFunc(
embedding_dim=768,
max_token_size=8192,
func=lambda texts: ollama_embedding(
texts,
embed_model="nomic-embed-text"
)
),
)
功能齐全的示例
examples/lightrag_ollama_demo.py这里有一个利用模型的功能齐全的示例gemma2:2b,仅并行运行 4 个请求,并将上下文大小设置为 32k。
低 RAM GPU
为了在低 RAM GPU 上运行此实验,您应该选择小模型并调整上下文窗口(增加上下文会增加内存消耗)。例如,在重新利用的挖矿 GPU 上运行此 ollama 示例,使用 时需要将上下文大小设置为 26k,内存为 6Gb gemma2:2b。它能够在 上找到 197 个实体和 19 个关系book.txt。
查询参数
python
class QueryParam:
mode: Literal["local", "global", "hybrid", "naive"] = "global"
only_need_context: bool = False
response_type: str = "Multiple Paragraphs"
# Number of top-k items to retrieve; corresponds to entities in "local" mode and relationships in "global" mode.
top_k: int = 60
# Number of tokens for the original chunks.
max_token_for_text_unit: int = 4000
# Number of tokens for the relationship descriptions
max_token_for_global_context: int = 4000
# Number of tokens for the entity descriptions
max_token_for_local_context: int = 4000
批量插入
python
# Batch Insert: Insert multiple texts at once
rag.insert(["TEXT1", "TEXT2",...])
增量插入
python
# Incremental Insert: Insert new documents into an existing LightRAG instance
rag = LightRAG(
working_dir=WORKING_DIR,
llm_model_func=llm_model_func,
embedding_func=EmbeddingFunc(
embedding_dim=embedding_dimension,
max_token_size=8192,
func=embedding_func,
),
)
with open("./newText.txt") as f:
rag.insert(f.read())
多文件类型支持
支持testract读取TXT、DOCX、PPTX、CSV和PDF等文件类型。
python
import textract
file_path = 'TEXT.pdf'
text_content = textract.process(file_path)
rag.insert(text_content.decode('utf-8'))
图形可视化
使用 HTML 进行图形可视化
- 以下代码可以在examples/graph_visual_with_html.py
python
import networkx as nx
from pyvis.network import Network
# Load the GraphML file
G = nx.read_graphml('./dickens/graph_chunk_entity_relation.graphml')
# Create a Pyvis network
net = Network(notebook=True)
# Convert NetworkX graph to Pyvis network
net.from_nx(G)
# Save and display the network
net.show('knowledge_graph.html')
使用 Neo4j 进行图形可视化
- 以下代码可以在examples/graph_visual_with_neo4j.py
python
import os
import json
from lightrag.utils import xml_to_json
from neo4j import GraphDatabase
# Constants
WORKING_DIR = "./dickens"
BATCH_SIZE_NODES = 500
BATCH_SIZE_EDGES = 100
# Neo4j connection credentials
NEO4J_URI = "bolt://localhost:7687"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "your_password"
def convert_xml_to_json(xml_path, output_path):
"""Converts XML file to JSON and saves the output."""
if not os.path.exists(xml_path):
print(f"Error: File not found - {xml_path}")
return None
json_data = xml_to_json(xml_path)
if json_data:
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(json_data, f, ensure_ascii=False, indent=2)
print(f"JSON file created: {output_path}")
return json_data
else:
print("Failed to create JSON data")
return None
def process_in_batches(tx, query, data, batch_size):
"""Process data in batches and execute the given query."""
for i in range(0, len(data), batch_size):
batch = data[i:i + batch_size]
tx.run(query, {"nodes": batch} if "nodes" in query else {"edges": batch})
def main():
# Paths
xml_file = os.path.join(WORKING_DIR, 'graph_chunk_entity_relation.graphml')
json_file = os.path.join(WORKING_DIR, 'graph_data.json')
# Convert XML to JSON
json_data = convert_xml_to_json(xml_file, json_file)
if json_data is None:
return
# Load nodes and edges
nodes = json_data.get('nodes', [])
edges = json_data.get('edges', [])
# Neo4j queries
create_nodes_query = """
UNWIND $nodes AS node
MERGE (e:Entity {id: node.id})
SET e.entity_type = node.entity_type,
e.description = node.description,
e.source_id = node.source_id,
e.displayName = node.id
REMOVE e:Entity
WITH e, node
CALL apoc.create.addLabels(e, [node.entity_type]) YIELD node AS labeledNode
RETURN count(*)
"""
create_edges_query = """
UNWIND $edges AS edge
MATCH (source {id: edge.source})
MATCH (target {id: edge.target})
WITH source, target, edge,
CASE
WHEN edge.keywords CONTAINS 'lead' THEN 'lead'
WHEN edge.keywords CONTAINS 'participate' THEN 'participate'
WHEN edge.keywords CONTAINS 'uses' THEN 'uses'
WHEN edge.keywords CONTAINS 'located' THEN 'located'
WHEN edge.keywords CONTAINS 'occurs' THEN 'occurs'
ELSE REPLACE(SPLIT(edge.keywords, ',')[0], '\"', '')
END AS relType
CALL apoc.create.relationship(source, relType, {
weight: edge.weight,
description: edge.description,
keywords: edge.keywords,
source_id: edge.source_id
}, target) YIELD rel
RETURN count(*)
"""
set_displayname_and_labels_query = """
MATCH (n)
SET n.displayName = n.id
WITH n
CALL apoc.create.setLabels(n, [n.entity_type]) YIELD node
RETURN count(*)
"""
# Create a Neo4j driver
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))
try:
# Execute queries in batches
with driver.session() as session:
# Insert nodes in batches
session.execute_write(process_in_batches, create_nodes_query, nodes, BATCH_SIZE_NODES)
# Insert edges in batches
session.execute_write(process_in_batches, create_edges_query, edges, BATCH_SIZE_EDGES)
# Set displayName and labels
session.run(set_displayname_and_labels_query)
except Exception as e:
print(f"Error occurred: {e}")
finally:
driver.close()
if __name__ == "__main__":
main()
API 服务器实现
LightRAG 还提供了基于 FastAPI 的服务器实现,用于通过 RESTful API 访问 RAG 操作。这样您就可以将 LightRAG 作为服务运行,并通过 HTTP 请求与其进行交互。
设置 API 服务器
设置说明
1.首先,确保您具有所需的依赖项:
bash
pip install fastapi uvicorn pydantic
2.设置环境变量:
bash
export RAG_DIR="your_index_directory" # Optional: Defaults to "index_default"
3.运行 API 服务器:
bash
python examples/lightrag_api_openai_compatible_demo.py
服务器将于 启动 http://0.0.0.0:8020
。
API 端点
API 服务器提供以下端点:
- 查询端点
- URL: /query
- Method: POST
- Body:
bash
{
"query": "Your question here",
"mode": "hybrid" // Can be "naive", "local", "global", or "hybrid"
}
- Example:
bash
curl -X POST "http://127.0.0.1:8020/query" \
-H "Content-Type: application/json" \
-d '{"query": "What are the main themes?", "mode": "hybrid"}'
- 插入文本端点
- URL: /insert
- Method: POST
- Body:
bash
{
"text": "Your text content here"
}
- Example:
bash
curl -X POST "http://127.0.0.1:8020/insert" \
-H "Content-Type: application/json" \
-d '{"text": "Content to be inserted into RAG"}'
- 插入文件端点
- URL: /insert_file
- Method: POST
- Body:
bash
{
"file_path": "path/to/your/file.txt"
}
- Example:
bash
curl -X POST "http://127.0.0.1:8020/insert_file" \
-H "Content-Type: application/json" \
-d '{"file_path": "./book.txt"}'
- 健康检查端点
- URL: /health
- Method: GET
- Example:
bash
curl -X GET "http://127.0.0.1:8020/health"
配置
可以使用环境变量来配置 API 服务器:
- RAG_DIR:存储 RAG 索引的目录(默认值:"index_default")
- 应在代码中为特定的 LLM 和嵌入模型提供程序配置 API 密钥和基本 URL
错误处理
该 API 包括全面的错误处理:
- 未找到文件错误(404)
- 处理错误(500)
- 支持多种文件编码(UTF-8和GBK)
评估
数据集
LightRAG 中使用的数据集可以从TommyChien/UltraDomain下载。
生成查询
LightRAG 使用以下提示来生成高级查询,其中相应的代码在 中example/generate_query.py。
迅速的
批量评估
为了评估两个 RAG 系统在高级查询上的性能,LightRAG 使用了以下提示,具体代码可在 中找到example/batch_eval.py。
迅速的
总体绩效表
Agriculture | CS | Legal | Mix | |||||
---|---|---|---|---|---|---|---|---|
NaiveRAG | LightRAG | NaiveRAG | LightRAG | NaiveRAG | LightRAG | NaiveRAG | LightRAG | |
全面性 | 32.69% | 67.31% | 35.44% | 64.56% | 19.05% | 80.95% | 36.36% | 63.64% |
多样性 | 24.09% | 75.91% | 35.24% | 64.76% | 10.98% | 89.02% | 30.76% | 69.24% |
赋能 | 31.35% | 68.65% | 35.48% | 64.52% | 17.59% | 82.41% | 40.95% | 59.05% |
整体 | 33.30% | 66.70% | 34.76% | 65.24% | 17.46% | 82.54% | 37.59% | 62.40% |
RQ-RAG | LightRAG | RQ-RAG | LightRAG | RQ-RAG | LightRAG | RQ-RAG | LightRAG | |
全面性 | 32.05% | 67.95% | 39.30% | 60.70% | 18.57% | 81.43% | 38.89% | 61.11% |
多样性 | 29.44% | 70.56% | 38.71% | 61.29% | 15.14% | 84.86% | 28.50% | 71.50% |
赋能 | 32.51% | 67.49% | 37.52% | 62.48% | 17.80% | 82.20% | 43.96% | 56.04% |
整体 | 33.29% | 66.71% | 39.03% | 60.97% | 17.80% | 82.20% | 39.61% | 60.39% |
HyDE | LightRAG | HyDE | LightRAG | HyDE | LightRAG | HyDE | LightRAG | |
全面性 | 24.39% | 75.61% | 36.49% | 63.51% | 27.68% | 72.32% | 42.17% | 57.83% |
多样性 | 24.96% | 75.34% | 37.41% | 62.59% | 18.79% | 81.21% | 30.88% | 69.12% |
赋能 | 24.89% | 75.11% | 34.99% | 65.01% | 26.99% | 73.01% | 45.61% | 54.39% |
整体 | 23.17% | 76.83% | 35.67% | 64.33% | 27.68% | 72.32% | 42.72% | 57.28% |
GraphRAG | LightRAG | GraphRAG | LightRAG | GraphRAG | LightRAG | GraphRAG | LightRAG | |
全面性 | 45.56% | 54.44% | 45.98% | 54.02% | 47.13% | 52.87% | 51.86% | 48.14% |
多样性 | 19.65% | 80.35% | 39.64% | 60.36% | 25.55% | 74.45% | 35.87% | 64.13% |
赋能 | 36.69% | 63.31% | 45.09% | 54.91% | 42.81% | 57.19% | 52.94% | 47.06% |
整体 | 43.62% | 56.38% | 45.98% | 54.02% | 45.70% | 54.30% | 51.86% | 48.14% |
- 以上数据基于 Agriculture、CS、Legal、Mix 四份数据集分别对LightRAG 、
NaiveRAG、 RQ-RAG 、HyDE 、GraphRAG
进行评估对比,从数据结果来看 LightRAG 在四份数据集下的在全面性、多样性、赋能、总体的表现完胜NaiveRAG、 RQ-RAG
。 和HYDE
对比,仅在Mix数据集评估的赋能方面上输于HYDE
,其他方面均高于HYDE
。 和GraphRAG
对比,也是在Mix数据集评估中,在全面性、赋能、总体方面输于GraphRAG
,其他方面均也高于GraphRAG
。
步骤重现
所有代码均可在./reproduce
目录中找到。
步骤 0 提取唯一上下文
首先,我们需要从数据集中提取独特的上下文。
代码
python
def extract_unique_contexts(input_directory, output_directory):
os.makedirs(output_directory, exist_ok=True)
jsonl_files = glob.glob(os.path.join(input_directory, '*.jsonl'))
print(f"Found {len(jsonl_files)} JSONL files.")
for file_path in jsonl_files:
filename = os.path.basename(file_path)
name, ext = os.path.splitext(filename)
output_filename = f"{name}_unique_contexts.json"
output_path = os.path.join(output_directory, output_filename)
unique_contexts_dict = {}
print(f"Processing file: {filename}")
try:
with open(file_path, 'r', encoding='utf-8') as infile:
for line_number, line in enumerate(infile, start=1):
line = line.strip()
if not line:
continue
try:
json_obj = json.loads(line)
context = json_obj.get('context')
if context and context not in unique_contexts_dict:
unique_contexts_dict[context] = None
except json.JSONDecodeError as e:
print(f"JSON decoding error in file {filename} at line {line_number}: {e}")
except FileNotFoundError:
print(f"File not found: {filename}")
continue
except Exception as e:
print(f"An error occurred while processing file {filename}: {e}")
continue
unique_contexts_list = list(unique_contexts_dict.keys())
print(f"There are {len(unique_contexts_list)} unique `context` entries in the file {filename}.")
try:
with open(output_path, 'w', encoding='utf-8') as outfile:
json.dump(unique_contexts_list, outfile, ensure_ascii=False, indent=4)
print(f"Unique `context` entries have been saved to: {output_filename}")
except Exception as e:
print(f"An error occurred while saving to the file {output_filename}: {e}")
print("All files have been processed.")
步骤 1 插入上下文
对于提取的上下文,我们将其插入到LightRAG系统中。
代码
python
def insert_text(rag, file_path):
with open(file_path, mode='r') as f:
unique_contexts = json.load(f)
retries = 0
max_retries = 3
while retries < max_retries:
try:
rag.insert(unique_contexts)
break
except Exception as e:
retries += 1
print(f"Insertion failed, retrying ({retries}/{max_retries}), error: {e}")
time.sleep(10)
if retries == max_retries:
print("Insertion failed after exceeding the maximum number of retries")
步骤 2 生成查询
我们从数据集中每个上下文的前半部分和后半部分提取标记,然后将它们组合为数据集描述以生成查询。
代码
python
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
def get_summary(context, tot_tokens=2000):
tokens = tokenizer.tokenize(context)
half_tokens = tot_tokens // 2
start_tokens = tokens[1000:1000 + half_tokens]
end_tokens = tokens[-(1000 + half_tokens):1000]
summary_tokens = start_tokens + end_tokens
summary = tokenizer.convert_tokens_to_string(summary_tokens)
return summary
步骤 3 查询
对于步骤 2 中生成的查询,我们将提取它们并查询 LightRAG。
代码
python
def extract_queries(file_path):
with open(file_path, 'r') as f:
data = f.read()
data = data.replace('**', '')
queries = re.findall(r'- Question \d+: (.+)', data)
return queries
代码结构
bash
.
├── examples
│ ├── batch_eval.py
│ ├── generate_query.py
│ ├── graph_visual_with_html.py
│ ├── graph_visual_with_neo4j.py
│ ├── lightrag_api_openai_compatible_demo.py
│ ├── lightrag_azure_openai_demo.py
│ ├── lightrag_bedrock_demo.py
│ ├── lightrag_hf_demo.py
│ ├── lightrag_lmdeploy_demo.py
│ ├── lightrag_ollama_demo.py
│ ├── lightrag_openai_compatible_demo.py
│ ├── lightrag_openai_demo.py
│ ├── lightrag_siliconcloud_demo.py
│ └── vram_management_demo.py
├── lightrag
│ ├── __init__.py
│ ├── base.py
│ ├── lightrag.py
│ ├── llm.py
│ ├── operate.py
│ ├── prompt.py
│ ├── storage.py
│ └── utils.py
├── reproduce
│ ├── Step_0.py
│ ├── Step_1_openai_compatible.py
│ ├── Step_1.py
│ ├── Step_2.py
│ ├── Step_3_openai_compatible.py
│ └── Step_3.py
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── README.md
├── requirements.txt
└── setup.py
星历史
引用
bash
@article{guo2024lightrag,
title={LightRAG: Simple and Fast Retrieval-Augmented Generation},
author={Zirui Guo and Lianghao Xia and Yanhua Yu and Tu Ao and Chao Huang},
year={2024},
eprint={2410.05779},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
github代码仓库 https://github.com/HKUDS/LightRAG/