比微软的GraphRag更加强大的LightRAG:简单快速的检索增强生成

🚀 LightRAG:简单快速的检索增强生成

该存储库托管了 LightRAG 的代码。该代码的结构基于nano-graphrag。 请添加图片描述

🎉 新闻

2024.10.29\]🎯📢LightRAG 现在支持多种文件类型,包括 PDF、DOC、PPT 和 CSV textract。 \[2024.10.20\]🎯📢我们为LightRAG添加了一项新功能:图形可视化。 \[2024.10.18\]🎯📢我们添加了LightRAG介绍视频的链接。感谢作者! \[2024.10.17\]🎯📢我们创建了Discord频道!欢迎加入分享和讨论!🎉🎉 \[2024.10.16\]🎯📢LightRAG 现在支持Ollama 模型! \[2024.10.15\]🎯📢LightRAG 现已支持Hugging Face 模型! ## 算法流程图 ![在这里插入图片描述](https://i-blog.csdnimg.cn/direct/1fa7f38095e74a1a8a61225d86f7e10d.png) ## 安装 * 从源码安装(推荐) ```bash cd LightRAG pip install -e . ``` * 从 PyPI 安装 ```bash pip install lightrag-hku ``` ## 快速入门 * 在本地运行 LightRAG 的视频演示。 * 所有代码均可在 中找到examples。 * 如果使用 OpenAI 模型,请在环境中设置 OpenAI API 密钥:export OPENAI_API_KEY="sk-...". * 下载演示文本"查尔斯·狄更斯的《圣诞颂歌》": ```bash curl https://raw.githubusercontent.com/gusye1234/nano-graphrag/main/tests/mock_data.txt > ./book.txt ``` 使用以下 Python 代码片段(在脚本中)初始化 LightRAG 并执行查询: ```python import os from lightrag import LightRAG, QueryParam from lightrag.llm import gpt_4o_mini_complete, gpt_4o_complete ######### # Uncomment the below two lines if running in a jupyter notebook to handle the async nature of rag.insert() # import nest_asyncio # nest_asyncio.apply() ######### WORKING_DIR = "./dickens" if not os.path.exists(WORKING_DIR): os.mkdir(WORKING_DIR) rag = LightRAG( working_dir=WORKING_DIR, llm_model_func=gpt_4o_mini_complete # Use gpt_4o_mini_complete LLM model # llm_model_func=gpt_4o_complete # Optionally, use a stronger model ) with open("./book.txt") as f: rag.insert(f.read()) # Perform naive search print(rag.query("What are the top themes in this story?", param=QueryParam(mode="naive"))) # Perform local search print(rag.query("What are the top themes in this story?", param=QueryParam(mode="local"))) # Perform global search print(rag.query("What are the top themes in this story?", param=QueryParam(mode="global"))) # Perform hybrid search print(rag.query("What are the top themes in this story?", param=QueryParam(mode="hybrid"))) ``` ### 使用类似开放 AI 的 API * LightRAG 还支持类似开放 AI 的聊天/嵌入 API: ```python async def llm_model_func( prompt, system_prompt=None, history_messages=[], **kwargs ) -> str: return await openai_complete_if_cache( "solar-mini", prompt, system_prompt=system_prompt, history_messages=history_messages, api_key=os.getenv("UPSTAGE_API_KEY"), base_url="https://api.upstage.ai/v1/solar", **kwargs ) async def embedding_func(texts: list[str]) -> np.ndarray: return await openai_embedding( texts, model="solar-embedding-1-large-query", api_key=os.getenv("UPSTAGE_API_KEY"), base_url="https://api.upstage.ai/v1/solar" ) rag = LightRAG( working_dir=WORKING_DIR, llm_model_func=llm_model_func, embedding_func=EmbeddingFunc( embedding_dim=4096, max_token_size=8192, func=embedding_func ) ) ``` ### 使用 Hugging Face 模型 * 如果要使用Hugging Face模型,只需要如下设置LightRAG: ```python from lightrag.llm import hf_model_complete, hf_embedding from transformers import AutoModel, AutoTokenizer # Initialize LightRAG with Hugging Face model rag = LightRAG( working_dir=WORKING_DIR, llm_model_func=hf_model_complete, # Use Hugging Face model for text generation llm_model_name='meta-llama/Llama-3.1-8B-Instruct', # Model name from Hugging Face # Use Hugging Face embedding function embedding_func=EmbeddingFunc( embedding_dim=384, max_token_size=5000, func=lambda texts: hf_embedding( texts, tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"), embed_model=AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") ) ), ) ``` ### 使用 Ollama 模型 #### 概述 如果您想使用 Ollama 模型,您需要拉取您计划使用的模型和嵌入模型,例如nomic-embed-text。 然后你只需要按如下方式设置LightRAG: ```python from lightrag.llm import ollama_model_complete, ollama_embedding # Initialize LightRAG with Ollama model rag = LightRAG( working_dir=WORKING_DIR, llm_model_func=ollama_model_complete, # Use Ollama model for text generation llm_model_name='your_model_name', # Your model name # Use Ollama embedding function embedding_func=EmbeddingFunc( embedding_dim=768, max_token_size=8192, func=lambda texts: ollama_embedding( texts, embed_model="nomic-embed-text" ) ), ) ``` #### 增加上下文大小 为了使 LightRAG 正常工作,上下文至少应为 32k 个标记。默认情况下,Ollama 模型的上下文大小为 8k。您可以使用以下两种方式之一实现此目的: ##### 增加num_ctxModelfile中的参数。 * 拉取模型: ```bash ollama pull qwen2 ``` * 显示模型文件: ```bash ollama show --modelfile qwen2 > Modelfile ``` * 编辑模型文件,添加以下行: ```bash PARAMETER num_ctx 32768 ``` * 创建修改后的模型: ```bash ollama create -f Modelfile qwen2m ``` ##### num_ctx通过 Ollama API设置。 Tiy 可以使用llm_model_kwargsparam 来配置 ollama: ```python rag = LightRAG( working_dir=WORKING_DIR, llm_model_func=ollama_model_complete, # Use Ollama model for text generation llm_model_name='your_model_name', # Your model name llm_model_kwargs={"options": {"num_ctx": 32768}}, # Use Ollama embedding function embedding_func=EmbeddingFunc( embedding_dim=768, max_token_size=8192, func=lambda texts: ollama_embedding( texts, embed_model="nomic-embed-text" ) ), ) ``` #### 功能齐全的示例 examples/lightrag_ollama_demo.py这里有一个利用模型的功能齐全的示例gemma2:2b,仅并行运行 4 个请求,并将上下文大小设置为 32k。 #### 低 RAM GPU 为了在低 RAM GPU 上运行此实验,您应该选择小模型并调整上下文窗口(增加上下文会增加内存消耗)。例如,在重新利用的挖矿 GPU 上运行此 ollama 示例,使用 时需要将上下文大小设置为 26k,内存为 6Gb gemma2:2b。它能够在 上找到 197 个实体和 19 个关系book.txt。 ### 查询参数 ```python class QueryParam: mode: Literal["local", "global", "hybrid", "naive"] = "global" only_need_context: bool = False response_type: str = "Multiple Paragraphs" # Number of top-k items to retrieve; corresponds to entities in "local" mode and relationships in "global" mode. top_k: int = 60 # Number of tokens for the original chunks. max_token_for_text_unit: int = 4000 # Number of tokens for the relationship descriptions max_token_for_global_context: int = 4000 # Number of tokens for the entity descriptions max_token_for_local_context: int = 4000 ``` ### 批量插入 ```python # Batch Insert: Insert multiple texts at once rag.insert(["TEXT1", "TEXT2",...]) ``` ### 增量插入 ```python # Incremental Insert: Insert new documents into an existing LightRAG instance rag = LightRAG( working_dir=WORKING_DIR, llm_model_func=llm_model_func, embedding_func=EmbeddingFunc( embedding_dim=embedding_dimension, max_token_size=8192, func=embedding_func, ), ) with open("./newText.txt") as f: rag.insert(f.read()) ``` ### 多文件类型支持 支持testract读取TXT、DOCX、PPTX、CSV和PDF等文件类型。 ```python import textract file_path = 'TEXT.pdf' text_content = textract.process(file_path) rag.insert(text_content.decode('utf-8')) ``` ### 图形可视化 #### 使用 HTML 进行图形可视化 * 以下代码可以在examples/graph_visual_with_html.py ```python import networkx as nx from pyvis.network import Network # Load the GraphML file G = nx.read_graphml('./dickens/graph_chunk_entity_relation.graphml') # Create a Pyvis network net = Network(notebook=True) # Convert NetworkX graph to Pyvis network net.from_nx(G) # Save and display the network net.show('knowledge_graph.html') ``` #### 使用 Neo4j 进行图形可视化 * 以下代码可以在examples/graph_visual_with_neo4j.py ```python import os import json from lightrag.utils import xml_to_json from neo4j import GraphDatabase # Constants WORKING_DIR = "./dickens" BATCH_SIZE_NODES = 500 BATCH_SIZE_EDGES = 100 # Neo4j connection credentials NEO4J_URI = "bolt://localhost:7687" NEO4J_USERNAME = "neo4j" NEO4J_PASSWORD = "your_password" def convert_xml_to_json(xml_path, output_path): """Converts XML file to JSON and saves the output.""" if not os.path.exists(xml_path): print(f"Error: File not found - {xml_path}") return None json_data = xml_to_json(xml_path) if json_data: with open(output_path, 'w', encoding='utf-8') as f: json.dump(json_data, f, ensure_ascii=False, indent=2) print(f"JSON file created: {output_path}") return json_data else: print("Failed to create JSON data") return None def process_in_batches(tx, query, data, batch_size): """Process data in batches and execute the given query.""" for i in range(0, len(data), batch_size): batch = data[i:i + batch_size] tx.run(query, {"nodes": batch} if "nodes" in query else {"edges": batch}) def main(): # Paths xml_file = os.path.join(WORKING_DIR, 'graph_chunk_entity_relation.graphml') json_file = os.path.join(WORKING_DIR, 'graph_data.json') # Convert XML to JSON json_data = convert_xml_to_json(xml_file, json_file) if json_data is None: return # Load nodes and edges nodes = json_data.get('nodes', []) edges = json_data.get('edges', []) # Neo4j queries create_nodes_query = """ UNWIND $nodes AS node MERGE (e:Entity {id: node.id}) SET e.entity_type = node.entity_type, e.description = node.description, e.source_id = node.source_id, e.displayName = node.id REMOVE e:Entity WITH e, node CALL apoc.create.addLabels(e, [node.entity_type]) YIELD node AS labeledNode RETURN count(*) """ create_edges_query = """ UNWIND $edges AS edge MATCH (source {id: edge.source}) MATCH (target {id: edge.target}) WITH source, target, edge, CASE WHEN edge.keywords CONTAINS 'lead' THEN 'lead' WHEN edge.keywords CONTAINS 'participate' THEN 'participate' WHEN edge.keywords CONTAINS 'uses' THEN 'uses' WHEN edge.keywords CONTAINS 'located' THEN 'located' WHEN edge.keywords CONTAINS 'occurs' THEN 'occurs' ELSE REPLACE(SPLIT(edge.keywords, ',')[0], '\"', '') END AS relType CALL apoc.create.relationship(source, relType, { weight: edge.weight, description: edge.description, keywords: edge.keywords, source_id: edge.source_id }, target) YIELD rel RETURN count(*) """ set_displayname_and_labels_query = """ MATCH (n) SET n.displayName = n.id WITH n CALL apoc.create.setLabels(n, [n.entity_type]) YIELD node RETURN count(*) """ # Create a Neo4j driver driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD)) try: # Execute queries in batches with driver.session() as session: # Insert nodes in batches session.execute_write(process_in_batches, create_nodes_query, nodes, BATCH_SIZE_NODES) # Insert edges in batches session.execute_write(process_in_batches, create_edges_query, edges, BATCH_SIZE_EDGES) # Set displayName and labels session.run(set_displayname_and_labels_query) except Exception as e: print(f"Error occurred: {e}") finally: driver.close() if __name__ == "__main__": main() ``` ## API 服务器实现 LightRAG 还提供了基于 FastAPI 的服务器实现,用于通过 RESTful API 访问 RAG 操作。这样您就可以将 LightRAG 作为服务运行,并通过 HTTP 请求与其进行交互。 ### 设置 API 服务器 设置说明 1.首先,确保您具有所需的依赖项: ```bash pip install fastapi uvicorn pydantic ``` 2.设置环境变量: ```bash export RAG_DIR="your_index_directory" # Optional: Defaults to "index_default" ``` 3.运行 API 服务器: ```bash python examples/lightrag_api_openai_compatible_demo.py ``` 服务器将于 启动 `http://0.0.0.0:8020`。 ### API 端点 API 服务器提供以下端点: 1. 查询端点 * **URL**: /query * **Method**: POST * **Body**: ```bash { "query": "Your question here", "mode": "hybrid" // Can be "naive", "local", "global", or "hybrid" } ``` * **Example**: ```bash curl -X POST "http://127.0.0.1:8020/query" \ -H "Content-Type: application/json" \ -d '{"query": "What are the main themes?", "mode": "hybrid"}' ``` 2. 插入文本端点 * **URL**: /insert * **Method**: POST * **Body**: ```bash { "text": "Your text content here" } ``` * **Example**: ```bash curl -X POST "http://127.0.0.1:8020/insert" \ -H "Content-Type: application/json" \ -d '{"text": "Content to be inserted into RAG"}' ``` 3. 插入文件端点 * **URL**: /insert_file * **Method**: POST * **Body**: ```bash { "file_path": "path/to/your/file.txt" } ``` * **Example**: ```bash curl -X POST "http://127.0.0.1:8020/insert_file" \ -H "Content-Type: application/json" \ -d '{"file_path": "./book.txt"}' ``` 4. 健康检查端点 * **URL**: /health * **Method**: GET * **Example**: ```bash curl -X GET "http://127.0.0.1:8020/health" ``` ### 配置 可以使用环境变量来配置 API 服务器: * RAG_DIR:存储 RAG 索引的目录(默认值:"index_default") * 应在代码中为特定的 LLM 和嵌入模型提供程序配置 API 密钥和基本 URL ### 错误处理 该 API 包括全面的错误处理: * 未找到文件错误(404) * 处理错误(500) * 支持多种文件编码(UTF-8和GBK) ## 评估 ### 数据集 LightRAG 中使用的数据集可以从TommyChien/UltraDomain下载。 ### 生成查询 LightRAG 使用以下提示来生成高级查询,其中相应的代码在 中example/generate_query.py。 迅速的 ### 批量评估 为了评估两个 RAG 系统在高级查询上的性能,LightRAG 使用了以下提示,具体代码可在 中找到example/batch_eval.py。 迅速的 ### 总体绩效表 | | **Agriculture** | | **CS** | | **Legal** | | **Mix** | | |---------|-----------------|--------------|----------|--------------|-----------|--------------|------------|--------------| | | NaiveRAG | **LightRAG** | NaiveRAG | **LightRAG** | NaiveRAG | **LightRAG** | NaiveRAG | **LightRAG** | | **全面性** | 32.69% | **67.31%** | 35.44% | **64.56%** | 19.05% | **80.95%** | 36.36% | **63.64%** | | **多样性** | 24.09% | **75.91%** | 35.24% | **64.76%** | 10.98% | **89.02%** | 30.76% | **69.24%** | | **赋能** | 31.35% | **68.65%** | 35.48% | **64.52%** | 17.59% | **82.41%** | 40.95% | **59.05%** | | **整体** | 33.30% | **66.70%** | 34.76% | **65.24%** | 17.46% | **82.54%** | 37.59% | **62.40%** | | | RQ-RAG | **LightRAG** | RQ-RAG | **LightRAG** | RQ-RAG | **LightRAG** | RQ-RAG | **LightRAG** | | **全面性** | 32.05% | **67.95%** | 39.30% | **60.70%** | 18.57% | **81.43%** | 38.89% | **61.11%** | | **多样性** | 29.44% | **70.56%** | 38.71% | **61.29%** | 15.14% | **84.86%** | 28.50% | **71.50%** | | **赋能** | 32.51% | **67.49%** | 37.52% | **62.48%** | 17.80% | **82.20%** | 43.96% | **56.04%** | | **整体** | 33.29% | **66.71%** | 39.03% | **60.97%** | 17.80% | **82.20%** | 39.61% | **60.39%** | | | HyDE | **LightRAG** | HyDE | **LightRAG** | HyDE | **LightRAG** | HyDE | **LightRAG** | | **全面性** | 24.39% | **75.61%** | 36.49% | **63.51%** | 27.68% | **72.32%** | 42.17% | **57.83%** | | **多样性** | 24.96% | **75.34%** | 37.41% | **62.59%** | 18.79% | **81.21%** | 30.88% | **69.12%** | | **赋能** | 24.89% | **75.11%** | 34.99% | **65.01%** | 26.99% | **73.01%** | **45.61%** | **54.39%** | | **整体** | 23.17% | **76.83%** | 35.67% | **64.33%** | 27.68% | **72.32%** | 42.72% | **57.28%** | | | GraphRAG | **LightRAG** | GraphRAG | **LightRAG** | GraphRAG | **LightRAG** | GraphRAG | **LightRAG** | | **全面性** | 45.56% | **54.44%** | 45.98% | **54.02%** | 47.13% | **52.87%** | **51.86%** | 48.14% | | **多样性** | 19.65% | **80.35%** | 39.64% | **60.36%** | 25.55% | **74.45%** | 35.87% | **64.13%** | | **赋能** | 36.69% | **63.31%** | 45.09% | **54.91%** | 42.81% | **57.19%** | **52.94%** | 47.06% | | **整体** | 43.62% | **56.38%** | 45.98% | **54.02%** | 45.70% | **54.30%** | **51.86%** | 48.14% | * 以上数据基于 Agriculture、CS、Legal、Mix 四份数据集分别对**LightRAG** 、`NaiveRAG、 RQ-RAG 、HyDE 、GraphRAG`进行评估对比,从数据结果来看 **LightRAG** 在四份数据集下的在全面性、多样性、赋能、总体的表现完胜`NaiveRAG、 RQ-RAG` 。 和`HYDE`对比,仅在Mix数据集评估的赋能方面上输于`HYDE`,其他方面均高于`HYDE`。 和`GraphRAG` 对比,也是在Mix数据集评估中,在全面性、赋能、总体方面输于`GraphRAG` ,其他方面均也高于`GraphRAG` 。 ## 步骤重现 所有代码均可在`./reproduce`目录中找到。 ### 步骤 0 提取唯一上下文 首先,我们需要从数据集中提取独特的上下文。 代码 ```python def extract_unique_contexts(input_directory, output_directory): os.makedirs(output_directory, exist_ok=True) jsonl_files = glob.glob(os.path.join(input_directory, '*.jsonl')) print(f"Found {len(jsonl_files)} JSONL files.") for file_path in jsonl_files: filename = os.path.basename(file_path) name, ext = os.path.splitext(filename) output_filename = f"{name}_unique_contexts.json" output_path = os.path.join(output_directory, output_filename) unique_contexts_dict = {} print(f"Processing file: {filename}") try: with open(file_path, 'r', encoding='utf-8') as infile: for line_number, line in enumerate(infile, start=1): line = line.strip() if not line: continue try: json_obj = json.loads(line) context = json_obj.get('context') if context and context not in unique_contexts_dict: unique_contexts_dict[context] = None except json.JSONDecodeError as e: print(f"JSON decoding error in file {filename} at line {line_number}: {e}") except FileNotFoundError: print(f"File not found: {filename}") continue except Exception as e: print(f"An error occurred while processing file {filename}: {e}") continue unique_contexts_list = list(unique_contexts_dict.keys()) print(f"There are {len(unique_contexts_list)} unique `context` entries in the file {filename}.") try: with open(output_path, 'w', encoding='utf-8') as outfile: json.dump(unique_contexts_list, outfile, ensure_ascii=False, indent=4) print(f"Unique `context` entries have been saved to: {output_filename}") except Exception as e: print(f"An error occurred while saving to the file {output_filename}: {e}") print("All files have been processed.") ``` ### 步骤 1 插入上下文 对于提取的上下文,我们将其插入到LightRAG系统中。 代码 ```python def insert_text(rag, file_path): with open(file_path, mode='r') as f: unique_contexts = json.load(f) retries = 0 max_retries = 3 while retries < max_retries: try: rag.insert(unique_contexts) break except Exception as e: retries += 1 print(f"Insertion failed, retrying ({retries}/{max_retries}), error: {e}") time.sleep(10) if retries == max_retries: print("Insertion failed after exceeding the maximum number of retries") ``` ### 步骤 2 生成查询 我们从数据集中每个上下文的前半部分和后半部分提取标记,然后将它们组合为数据集描述以生成查询。 代码 ```python tokenizer = GPT2Tokenizer.from_pretrained('gpt2') def get_summary(context, tot_tokens=2000): tokens = tokenizer.tokenize(context) half_tokens = tot_tokens // 2 start_tokens = tokens[1000:1000 + half_tokens] end_tokens = tokens[-(1000 + half_tokens):1000] summary_tokens = start_tokens + end_tokens summary = tokenizer.convert_tokens_to_string(summary_tokens) return summary ``` ### 步骤 3 查询 对于步骤 2 中生成的查询,我们将提取它们并查询 LightRAG。 代码 ```python def extract_queries(file_path): with open(file_path, 'r') as f: data = f.read() data = data.replace('**', '') queries = re.findall(r'- Question \d+: (.+)', data) return queries ``` ## 代码结构 ```bash . ├── examples │ ├── batch_eval.py │ ├── generate_query.py │ ├── graph_visual_with_html.py │ ├── graph_visual_with_neo4j.py │ ├── lightrag_api_openai_compatible_demo.py │ ├── lightrag_azure_openai_demo.py │ ├── lightrag_bedrock_demo.py │ ├── lightrag_hf_demo.py │ ├── lightrag_lmdeploy_demo.py │ ├── lightrag_ollama_demo.py │ ├── lightrag_openai_compatible_demo.py │ ├── lightrag_openai_demo.py │ ├── lightrag_siliconcloud_demo.py │ └── vram_management_demo.py ├── lightrag │ ├── __init__.py │ ├── base.py │ ├── lightrag.py │ ├── llm.py │ ├── operate.py │ ├── prompt.py │ ├── storage.py │ └── utils.py ├── reproduce │ ├── Step_0.py │ ├── Step_1_openai_compatible.py │ ├── Step_1.py │ ├── Step_2.py │ ├── Step_3_openai_compatible.py │ └── Step_3.py ├── .gitignore ├── .pre-commit-config.yaml ├── LICENSE ├── README.md ├── requirements.txt └── setup.py ``` ## 星历史 ![在这里插入图片描述](https://i-blog.csdnimg.cn/direct/954af3c3611346e799239202e7069151.png) ## 引用 ```bash @article{guo2024lightrag, title={LightRAG: Simple and Fast Retrieval-Augmented Generation}, author={Zirui Guo and Lianghao Xia and Yanhua Yu and Tu Ao and Chao Huang}, year={2024}, eprint={2410.05779}, archivePrefix={arXiv}, primaryClass={cs.IR} } ``` github代码仓库

相关推荐
m0_55576290几秒前
QT 动态布局实现(待完善)
服务器·数据库·qt
橘子在努力6 分钟前
【橘子大模型】关于PromptTemplate
python·ai·llama
SheepMeMe24 分钟前
蓝桥杯2024省赛PythonB组——日期问题
python·算法·蓝桥杯
莓事哒33 分钟前
selenium和pytessarct提取古诗文网的验证码(python爬虫)
爬虫·python·selenium·测试工具·pycharm
q567315231 小时前
使用puppeteer库编写的爬虫程序
爬虫·python·网络协议·http
mosquito_lover11 小时前
Python数据分析与可视化实战
python·数据挖掘·数据分析
孪生质数-1 小时前
SQL server 2022和SSMS的使用案例1
网络·数据库·后端·科技·架构
eqwaak01 小时前
量子计算与AI音乐——解锁无限可能的音色宇宙
人工智能·爬虫·python·自动化·量子计算
SylviaW081 小时前
python-leetcode 63.搜索二维矩阵
python·leetcode·矩阵
振鹏Dong1 小时前
MySQL 事务底层和高可用原理
数据库·mysql