比微软的GraphRag更加强大的LightRAG:简单快速的检索增强生成

🚀 LightRAG:简单快速的检索增强生成

该存储库托管了 LightRAG 的代码。该代码的结构基于nano-graphrag。 请添加图片描述

🎉 新闻

[2024.10.29]🎯📢LightRAG 现在支持多种文件类型,包括 PDF、DOC、PPT 和 CSV textract。

[2024.10.20]🎯📢我们为LightRAG添加了一项新功能:图形可视化。

[2024.10.18]🎯📢我们添加了LightRAG介绍视频的链接。感谢作者!

[2024.10.17]🎯📢我们创建了Discord频道!欢迎加入分享和讨论!🎉🎉

[2024.10.16]🎯📢LightRAG 现在支持Ollama 模型!

[2024.10.15]🎯📢LightRAG 现已支持Hugging Face 模型!

算法流程图

安装

  • 从源码安装(推荐)
bash 复制代码
cd LightRAG
pip install -e .
  • 从 PyPI 安装
bash 复制代码
pip install lightrag-hku

快速入门

  • 在本地运行 LightRAG 的视频演示。
  • 所有代码均可在 中找到examples。
  • 如果使用 OpenAI 模型,请在环境中设置 OpenAI API 密钥:export OPENAI_API_KEY="sk-...".
  • 下载演示文本"查尔斯·狄更斯的《圣诞颂歌》":
bash 复制代码
curl https://raw.githubusercontent.com/gusye1234/nano-graphrag/main/tests/mock_data.txt > ./book.txt

使用以下 Python 代码片段(在脚本中)初始化 LightRAG 并执行查询:

python 复制代码
import os
from lightrag import LightRAG, QueryParam
from lightrag.llm import gpt_4o_mini_complete, gpt_4o_complete

#########
# Uncomment the below two lines if running in a jupyter notebook to handle the async nature of rag.insert()
# import nest_asyncio
# nest_asyncio.apply()
#########

WORKING_DIR = "./dickens"


if not os.path.exists(WORKING_DIR):
    os.mkdir(WORKING_DIR)

rag = LightRAG(
    working_dir=WORKING_DIR,
    llm_model_func=gpt_4o_mini_complete  # Use gpt_4o_mini_complete LLM model
    # llm_model_func=gpt_4o_complete  # Optionally, use a stronger model
)

with open("./book.txt") as f:
    rag.insert(f.read())


# Perform naive search
print(rag.query("What are the top themes in this story?", param=QueryParam(mode="naive")))

# Perform local search
print(rag.query("What are the top themes in this story?", param=QueryParam(mode="local")))

# Perform global search
print(rag.query("What are the top themes in this story?", param=QueryParam(mode="global")))


# Perform hybrid search
print(rag.query("What are the top themes in this story?", param=QueryParam(mode="hybrid")))

使用类似开放 AI 的 API

  • LightRAG 还支持类似开放 AI 的聊天/嵌入 API:
python 复制代码
async def llm_model_func(
    prompt, system_prompt=None, history_messages=[], **kwargs
) -> str:
    return await openai_complete_if_cache(
        "solar-mini",
        prompt,
        system_prompt=system_prompt,
        history_messages=history_messages,
        api_key=os.getenv("UPSTAGE_API_KEY"),
        base_url="https://api.upstage.ai/v1/solar",
        **kwargs
    )

async def embedding_func(texts: list[str]) -> np.ndarray:
    return await openai_embedding(
        texts,
        model="solar-embedding-1-large-query",
        api_key=os.getenv("UPSTAGE_API_KEY"),
        base_url="https://api.upstage.ai/v1/solar"
    )

rag = LightRAG(
    working_dir=WORKING_DIR,
    llm_model_func=llm_model_func,
    embedding_func=EmbeddingFunc(
        embedding_dim=4096,
        max_token_size=8192,
        func=embedding_func
    )
)

使用 Hugging Face 模型

  • 如果要使用Hugging Face模型,只需要如下设置LightRAG:
python 复制代码
from lightrag.llm import hf_model_complete, hf_embedding
from transformers import AutoModel, AutoTokenizer

# Initialize LightRAG with Hugging Face model
rag = LightRAG(
    working_dir=WORKING_DIR,
    llm_model_func=hf_model_complete,  # Use Hugging Face model for text generation
    llm_model_name='meta-llama/Llama-3.1-8B-Instruct',  # Model name from Hugging Face
    # Use Hugging Face embedding function
    embedding_func=EmbeddingFunc(
        embedding_dim=384,
        max_token_size=5000,
        func=lambda texts: hf_embedding(
            texts,
            tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
            embed_model=AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        )
    ),
)

使用 Ollama 模型

概述

如果您想使用 Ollama 模型,您需要拉取您计划使用的模型和嵌入模型,例如nomic-embed-text。

然后你只需要按如下方式设置LightRAG:

python 复制代码
from lightrag.llm import ollama_model_complete, ollama_embedding

# Initialize LightRAG with Ollama model
rag = LightRAG(
    working_dir=WORKING_DIR,
    llm_model_func=ollama_model_complete,  # Use Ollama model for text generation
    llm_model_name='your_model_name', # Your model name
    # Use Ollama embedding function
    embedding_func=EmbeddingFunc(
        embedding_dim=768,
        max_token_size=8192,
        func=lambda texts: ollama_embedding(
            texts,
            embed_model="nomic-embed-text"
        )
    ),
)

增加上下文大小

为了使 LightRAG 正常工作,上下文至少应为 32k 个标记。默认情况下,Ollama 模型的上下文大小为 8k。您可以使用以下两种方式之一实现此目的:

增加num_ctxModelfile中的参数。
  • 拉取模型:
bash 复制代码
ollama pull qwen2
  • 显示模型文件:
bash 复制代码
ollama show --modelfile qwen2 > Modelfile
  • 编辑模型文件,添加以下行:
bash 复制代码
PARAMETER num_ctx 32768
  • 创建修改后的模型:
bash 复制代码
ollama create -f Modelfile qwen2m
num_ctx通过 Ollama API设置。

Tiy 可以使用llm_model_kwargsparam 来配置 ollama:

python 复制代码
rag = LightRAG(
    working_dir=WORKING_DIR,
    llm_model_func=ollama_model_complete,  # Use Ollama model for text generation
    llm_model_name='your_model_name', # Your model name
    llm_model_kwargs={"options": {"num_ctx": 32768}},
    # Use Ollama embedding function
    embedding_func=EmbeddingFunc(
        embedding_dim=768,
        max_token_size=8192,
        func=lambda texts: ollama_embedding(
            texts,
            embed_model="nomic-embed-text"
        )
    ),
)

功能齐全的示例

examples/lightrag_ollama_demo.py这里有一个利用模型的功能齐全的示例gemma2:2b,仅并行运行 4 个请求,并将上下文大小设置为 32k。

低 RAM GPU

为了在低 RAM GPU 上运行此实验,您应该选择小模型并调整上下文窗口(增加上下文会增加内存消耗)。例如,在重新利用的挖矿 GPU 上运行此 ollama 示例,使用 时需要将上下文大小设置为 26k,内存为 6Gb gemma2:2b。它能够在 上找到 197 个实体和 19 个关系book.txt。

查询参数

python 复制代码
class QueryParam:
    mode: Literal["local", "global", "hybrid", "naive"] = "global"
    only_need_context: bool = False
    response_type: str = "Multiple Paragraphs"
    # Number of top-k items to retrieve; corresponds to entities in "local" mode and relationships in "global" mode.
    top_k: int = 60
    # Number of tokens for the original chunks.
    max_token_for_text_unit: int = 4000
    # Number of tokens for the relationship descriptions
    max_token_for_global_context: int = 4000
    # Number of tokens for the entity descriptions
    max_token_for_local_context: int = 4000

批量插入

python 复制代码
# Batch Insert: Insert multiple texts at once
rag.insert(["TEXT1", "TEXT2",...])

增量插入

python 复制代码
# Incremental Insert: Insert new documents into an existing LightRAG instance
rag = LightRAG(
     working_dir=WORKING_DIR,
     llm_model_func=llm_model_func,
     embedding_func=EmbeddingFunc(
          embedding_dim=embedding_dimension,
          max_token_size=8192,
          func=embedding_func,
     ),
)

with open("./newText.txt") as f:
    rag.insert(f.read())

多文件类型支持

支持testract读取TXT、DOCX、PPTX、CSV和PDF等文件类型。

python 复制代码
import textract

file_path = 'TEXT.pdf'
text_content = textract.process(file_path)

rag.insert(text_content.decode('utf-8'))

图形可视化

使用 HTML 进行图形可视化

  • 以下代码可以在examples/graph_visual_with_html.py
python 复制代码
import networkx as nx
from pyvis.network import Network

# Load the GraphML file
G = nx.read_graphml('./dickens/graph_chunk_entity_relation.graphml')

# Create a Pyvis network
net = Network(notebook=True)

# Convert NetworkX graph to Pyvis network
net.from_nx(G)

# Save and display the network
net.show('knowledge_graph.html')

使用 Neo4j 进行图形可视化

  • 以下代码可以在examples/graph_visual_with_neo4j.py
python 复制代码
import os
import json
from lightrag.utils import xml_to_json
from neo4j import GraphDatabase

# Constants
WORKING_DIR = "./dickens"
BATCH_SIZE_NODES = 500
BATCH_SIZE_EDGES = 100

# Neo4j connection credentials
NEO4J_URI = "bolt://localhost:7687"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "your_password"

def convert_xml_to_json(xml_path, output_path):
    """Converts XML file to JSON and saves the output."""
    if not os.path.exists(xml_path):
        print(f"Error: File not found - {xml_path}")
        return None

    json_data = xml_to_json(xml_path)
    if json_data:
        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(json_data, f, ensure_ascii=False, indent=2)
        print(f"JSON file created: {output_path}")
        return json_data
    else:
        print("Failed to create JSON data")
        return None

def process_in_batches(tx, query, data, batch_size):
    """Process data in batches and execute the given query."""
    for i in range(0, len(data), batch_size):
        batch = data[i:i + batch_size]
        tx.run(query, {"nodes": batch} if "nodes" in query else {"edges": batch})

def main():
    # Paths
    xml_file = os.path.join(WORKING_DIR, 'graph_chunk_entity_relation.graphml')
    json_file = os.path.join(WORKING_DIR, 'graph_data.json')

    # Convert XML to JSON
    json_data = convert_xml_to_json(xml_file, json_file)
    if json_data is None:
        return

    # Load nodes and edges
    nodes = json_data.get('nodes', [])
    edges = json_data.get('edges', [])

    # Neo4j queries
    create_nodes_query = """
    UNWIND $nodes AS node
    MERGE (e:Entity {id: node.id})
    SET e.entity_type = node.entity_type,
        e.description = node.description,
        e.source_id = node.source_id,
        e.displayName = node.id
    REMOVE e:Entity
    WITH e, node
    CALL apoc.create.addLabels(e, [node.entity_type]) YIELD node AS labeledNode
    RETURN count(*)
    """

    create_edges_query = """
    UNWIND $edges AS edge
    MATCH (source {id: edge.source})
    MATCH (target {id: edge.target})
    WITH source, target, edge,
         CASE
            WHEN edge.keywords CONTAINS 'lead' THEN 'lead'
            WHEN edge.keywords CONTAINS 'participate' THEN 'participate'
            WHEN edge.keywords CONTAINS 'uses' THEN 'uses'
            WHEN edge.keywords CONTAINS 'located' THEN 'located'
            WHEN edge.keywords CONTAINS 'occurs' THEN 'occurs'
           ELSE REPLACE(SPLIT(edge.keywords, ',')[0], '\"', '')
         END AS relType
    CALL apoc.create.relationship(source, relType, {
      weight: edge.weight,
      description: edge.description,
      keywords: edge.keywords,
      source_id: edge.source_id
    }, target) YIELD rel
    RETURN count(*)
    """

    set_displayname_and_labels_query = """
    MATCH (n)
    SET n.displayName = n.id
    WITH n
    CALL apoc.create.setLabels(n, [n.entity_type]) YIELD node
    RETURN count(*)
    """

    # Create a Neo4j driver
    driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

    try:
        # Execute queries in batches
        with driver.session() as session:
            # Insert nodes in batches
            session.execute_write(process_in_batches, create_nodes_query, nodes, BATCH_SIZE_NODES)

            # Insert edges in batches
            session.execute_write(process_in_batches, create_edges_query, edges, BATCH_SIZE_EDGES)

            # Set displayName and labels
            session.run(set_displayname_and_labels_query)

    except Exception as e:
        print(f"Error occurred: {e}")

    finally:
        driver.close()

if __name__ == "__main__":
    main()

API 服务器实现

LightRAG 还提供了基于 FastAPI 的服务器实现,用于通过 RESTful API 访问 RAG 操作。这样您就可以将 LightRAG 作为服务运行,并通过 HTTP 请求与其进行交互。

设置 API 服务器

设置说明

1.首先,确保您具有所需的依赖项:

bash 复制代码
pip install fastapi uvicorn pydantic

2.设置环境变量:

bash 复制代码
export RAG_DIR="your_index_directory"  # Optional: Defaults to "index_default"

3.运行 API 服务器:

bash 复制代码
python examples/lightrag_api_openai_compatible_demo.py

服务器将于 启动 http://0.0.0.0:8020

API 端点

API 服务器提供以下端点:

  1. 查询端点
  • URL: /query
  • Method: POST
  • Body
bash 复制代码
{
    "query": "Your question here",
    "mode": "hybrid"  // Can be "naive", "local", "global", or "hybrid"
}
  • Example
bash 复制代码
curl -X POST "http://127.0.0.1:8020/query" \
     -H "Content-Type: application/json" \
     -d '{"query": "What are the main themes?", "mode": "hybrid"}'
  1. 插入文本端点
  • URL: /insert
  • Method: POST
  • Body:
bash 复制代码
{
    "text": "Your text content here"
}
  • Example:
bash 复制代码
curl -X POST "http://127.0.0.1:8020/insert" \
     -H "Content-Type: application/json" \
     -d '{"text": "Content to be inserted into RAG"}'
  1. 插入文件端点
  • URL: /insert_file
  • Method: POST
  • Body:
bash 复制代码
{
    "file_path": "path/to/your/file.txt"
}
  • Example:
bash 复制代码
curl -X POST "http://127.0.0.1:8020/insert_file" \
     -H "Content-Type: application/json" \
     -d '{"file_path": "./book.txt"}'
  1. 健康检查端点
  • URL: /health
  • Method: GET
  • Example:
bash 复制代码
curl -X GET "http://127.0.0.1:8020/health"

配置

可以使用环境变量来配置 API 服务器:

  • RAG_DIR:存储 RAG 索引的目录(默认值:"index_default")
  • 应在代码中为特定的 LLM 和嵌入模型提供程序配置 API 密钥和基本 URL

错误处理

该 API 包括全面的错误处理:

  • 未找到文件错误(404)
  • 处理错误(500)
  • 支持多种文件编码(UTF-8和GBK)

评估

数据集

LightRAG 中使用的数据集可以从TommyChien/UltraDomain下载。

生成查询

LightRAG 使用以下提示来生成高级查询,其中相应的代码在 中example/generate_query.py。

迅速的

批量评估

为了评估两个 RAG 系统在高级查询上的性能,LightRAG 使用了以下提示,具体代码可在 中找到example/batch_eval.py。

迅速的

总体绩效表

Agriculture CS Legal Mix
NaiveRAG LightRAG NaiveRAG LightRAG NaiveRAG LightRAG NaiveRAG LightRAG
全面性 32.69% 67.31% 35.44% 64.56% 19.05% 80.95% 36.36% 63.64%
多样性 24.09% 75.91% 35.24% 64.76% 10.98% 89.02% 30.76% 69.24%
赋能 31.35% 68.65% 35.48% 64.52% 17.59% 82.41% 40.95% 59.05%
整体 33.30% 66.70% 34.76% 65.24% 17.46% 82.54% 37.59% 62.40%
RQ-RAG LightRAG RQ-RAG LightRAG RQ-RAG LightRAG RQ-RAG LightRAG
全面性 32.05% 67.95% 39.30% 60.70% 18.57% 81.43% 38.89% 61.11%
多样性 29.44% 70.56% 38.71% 61.29% 15.14% 84.86% 28.50% 71.50%
赋能 32.51% 67.49% 37.52% 62.48% 17.80% 82.20% 43.96% 56.04%
整体 33.29% 66.71% 39.03% 60.97% 17.80% 82.20% 39.61% 60.39%
HyDE LightRAG HyDE LightRAG HyDE LightRAG HyDE LightRAG
全面性 24.39% 75.61% 36.49% 63.51% 27.68% 72.32% 42.17% 57.83%
多样性 24.96% 75.34% 37.41% 62.59% 18.79% 81.21% 30.88% 69.12%
赋能 24.89% 75.11% 34.99% 65.01% 26.99% 73.01% 45.61% 54.39%
整体 23.17% 76.83% 35.67% 64.33% 27.68% 72.32% 42.72% 57.28%
GraphRAG LightRAG GraphRAG LightRAG GraphRAG LightRAG GraphRAG LightRAG
全面性 45.56% 54.44% 45.98% 54.02% 47.13% 52.87% 51.86% 48.14%
多样性 19.65% 80.35% 39.64% 60.36% 25.55% 74.45% 35.87% 64.13%
赋能 36.69% 63.31% 45.09% 54.91% 42.81% 57.19% 52.94% 47.06%
整体 43.62% 56.38% 45.98% 54.02% 45.70% 54.30% 51.86% 48.14%
  • 以上数据基于 Agriculture、CS、Legal、Mix 四份数据集分别对LightRAGNaiveRAG、 RQ-RAG 、HyDE 、GraphRAG进行评估对比,从数据结果来看 LightRAG 在四份数据集下的在全面性、多样性、赋能、总体的表现完胜NaiveRAG、 RQ-RAG 。 和HYDE对比,仅在Mix数据集评估的赋能方面上输于HYDE,其他方面均高于HYDE。 和GraphRAG 对比,也是在Mix数据集评估中,在全面性、赋能、总体方面输于GraphRAG ,其他方面均也高于GraphRAG

步骤重现

所有代码均可在./reproduce目录中找到。

步骤 0 提取唯一上下文

首先,我们需要从数据集中提取独特的上下文。

代码

python 复制代码
def extract_unique_contexts(input_directory, output_directory):

    os.makedirs(output_directory, exist_ok=True)

    jsonl_files = glob.glob(os.path.join(input_directory, '*.jsonl'))
    print(f"Found {len(jsonl_files)} JSONL files.")

    for file_path in jsonl_files:
        filename = os.path.basename(file_path)
        name, ext = os.path.splitext(filename)
        output_filename = f"{name}_unique_contexts.json"
        output_path = os.path.join(output_directory, output_filename)

        unique_contexts_dict = {}

        print(f"Processing file: {filename}")

        try:
            with open(file_path, 'r', encoding='utf-8') as infile:
                for line_number, line in enumerate(infile, start=1):
                    line = line.strip()
                    if not line:
                        continue
                    try:
                        json_obj = json.loads(line)
                        context = json_obj.get('context')
                        if context and context not in unique_contexts_dict:
                            unique_contexts_dict[context] = None
                    except json.JSONDecodeError as e:
                        print(f"JSON decoding error in file {filename} at line {line_number}: {e}")
        except FileNotFoundError:
            print(f"File not found: {filename}")
            continue
        except Exception as e:
            print(f"An error occurred while processing file {filename}: {e}")
            continue

        unique_contexts_list = list(unique_contexts_dict.keys())
        print(f"There are {len(unique_contexts_list)} unique `context` entries in the file {filename}.")

        try:
            with open(output_path, 'w', encoding='utf-8') as outfile:
                json.dump(unique_contexts_list, outfile, ensure_ascii=False, indent=4)
            print(f"Unique `context` entries have been saved to: {output_filename}")
        except Exception as e:
            print(f"An error occurred while saving to the file {output_filename}: {e}")

    print("All files have been processed.")

步骤 1 插入上下文

对于提取的上下文,我们将其插入到LightRAG系统中。

代码

python 复制代码
def insert_text(rag, file_path):
    with open(file_path, mode='r') as f:
        unique_contexts = json.load(f)

    retries = 0
    max_retries = 3
    while retries < max_retries:
        try:
            rag.insert(unique_contexts)
            break
        except Exception as e:
            retries += 1
            print(f"Insertion failed, retrying ({retries}/{max_retries}), error: {e}")
            time.sleep(10)
    if retries == max_retries:
        print("Insertion failed after exceeding the maximum number of retries")

步骤 2 生成查询

我们从数据集中每个上下文的前半部分和后半部分提取标记,然后将它们组合为数据集描述以生成查询。

代码

python 复制代码
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

def get_summary(context, tot_tokens=2000):
    tokens = tokenizer.tokenize(context)
    half_tokens = tot_tokens // 2

    start_tokens = tokens[1000:1000 + half_tokens]
    end_tokens = tokens[-(1000 + half_tokens):1000]

    summary_tokens = start_tokens + end_tokens
    summary = tokenizer.convert_tokens_to_string(summary_tokens)

    return summary

步骤 3 查询

对于步骤 2 中生成的查询,我们将提取它们并查询 LightRAG。

代码

python 复制代码
def extract_queries(file_path):
    with open(file_path, 'r') as f:
        data = f.read()

    data = data.replace('**', '')

    queries = re.findall(r'- Question \d+: (.+)', data)

    return queries

代码结构

bash 复制代码
.
├── examples
│   ├── batch_eval.py
│   ├── generate_query.py
│   ├── graph_visual_with_html.py
│   ├── graph_visual_with_neo4j.py
│   ├── lightrag_api_openai_compatible_demo.py
│   ├── lightrag_azure_openai_demo.py
│   ├── lightrag_bedrock_demo.py
│   ├── lightrag_hf_demo.py
│   ├── lightrag_lmdeploy_demo.py
│   ├── lightrag_ollama_demo.py
│   ├── lightrag_openai_compatible_demo.py
│   ├── lightrag_openai_demo.py
│   ├── lightrag_siliconcloud_demo.py
│   └── vram_management_demo.py
├── lightrag
│   ├── __init__.py
│   ├── base.py
│   ├── lightrag.py
│   ├── llm.py
│   ├── operate.py
│   ├── prompt.py
│   ├── storage.py
│   └── utils.py
├── reproduce
│   ├── Step_0.py
│   ├── Step_1_openai_compatible.py
│   ├── Step_1.py
│   ├── Step_2.py
│   ├── Step_3_openai_compatible.py
│   └── Step_3.py
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── README.md
├── requirements.txt
└── setup.py

星历史

引用

bash 复制代码
@article{guo2024lightrag,
title={LightRAG: Simple and Fast Retrieval-Augmented Generation},
author={Zirui Guo and Lianghao Xia and Yanhua Yu and Tu Ao and Chao Huang},
year={2024},
eprint={2410.05779},
archivePrefix={arXiv},
primaryClass={cs.IR}
}

github代码仓库 https://github.com/HKUDS/LightRAG/

相关推荐
wuyu112511 分钟前
Qt字符编码
数据库·mysql·mybatis
API快乐传递者26 分钟前
用 Python 爬取淘宝商品价格信息时需要注意什么?
java·开发语言·爬虫·python·json
Aurora_th30 分钟前
蓝桥杯 Python组-神奇闹钟(datetime库)
python·算法·职场和发展·蓝桥杯·datetime
萧鼎1 小时前
【Python】计算机视觉应用:OpenCV库图像处理入门
python·opencv
许灵均均1 小时前
数据库视图
数据库
子午1 小时前
【车辆车型识别】Python+卷积神经网络算法+深度学习+人工智能+TensorFlow+算法模型
人工智能·python·深度学习
AI_小站1 小时前
【AI工作流】FastGPT - 深入解析FastGPT工作流编排:从基础到高级应用的全面指南
人工智能·程序人生·语言模型·大模型·llm·fastgpt·大模型应用
是个热心市民1 小时前
构建一个导航栏web
前端·javascript·python·django·html
大哇唧2 小时前
python批量合并excel文件
开发语言·python·excel
墨城烟柳Q2 小时前
自动化爬虫-selenium模块万字详解
爬虫·python·selenium·自动化