GraphRAG部署流程及Neo4j展示

主要参考:
微软开源GraphRAG的使用教程(最全,非常详细)
微软开源GraphRAG的使用教程-使用自定义数据测试GraphRAG GraphRAG + GPT-4o mini 低成本构建 AI 图谱知识库
GraphRAG+Ollama实现本地部署(最全,非常详细,保姆教程)

(OPENAI Models)platform.openai.com/docs/models

(neo4j可视化)neo4j.com/developer-b...

(部署Qwen)segmentfault.com/a/119000004...

(中文文档)graphragcn.com/get_started

(官方操作文档)microsoft.github.io/graphrag/po...

源码准备

官方Github:github.com/microsoft/g...

下载项目

bash 复制代码
git clone https://github.com/microsoft/graphrag.git

进入目录

bash 复制代码
cd graphrag

环境准备

(已安装好anaconda)创建虚拟环境,使用python3.11:

bash 复制代码
conda create -n GraphRAG python=3.11
conda activate GraphRAG

下载依赖及初始化

由于graphrag是通过poetry进行管理,需要安装poetry资源包管理工具及相关依赖

安装poetry

bash 复制代码
pip install poetry 
poetry install

初始化

bash 复制代码
poetry run poe index --init --root .   

正确运行后,此处会在graphrag目录下生成output、prompts、.env、settings.yaml文件

检索的文档放入./input/目录下

注意目前GraphRAG仅支持txt和csv文件,其他格式需要先转换

官方给出的文档是查尔斯・狄更斯创作的一部著名小说《圣诞颂歌》

bash 复制代码
curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/book.txt

修改配置文件

修改文件主要有两个 .envsettings.yaml

.env 文件,里面需要填入一些配置。这些配置通常包括 API 密钥、模型选择等重要参数。

官方指定的是使用GPT-4 Turbo preview,将OPENAI提供的 API 密钥填入 GRAPHRAG_API_KEY 即可

修改后是全局配置,后续不需要再次修改

settings.yaml 文件是针对GraphRAG的流程进行配置,修改使用的llm模型和对应的api_base

GraphRAG需要多次调用大模型和Embedding,默认使用的是openai的GPT-4费用昂贵,可以使用其他模型或国产大模型的api

这里使用的是agicto提供的APIkey(新用户注册可以免费获取到10元的调用额度)。

主要修改yaml文件中的API地址和调用模型(这里使用deepseek),修改完成后的settings文件完整内容如下:

agicto调用deepseek可以直接使用以下settings.yaml

yaml 复制代码
encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: deepseek-chat  #修改
  model_supports_json: false # recommended if this is available for your model.
  api_base: https://api.agicto.cn/v1 #修改
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small #修改
    api_base: https://api.agicto.cn/v1 #修改
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional
  

chunks:
  size: 300
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents
    
input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 0

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 0

community_report:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: false
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # max_tokens: 12000

global_search:
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

构建GraphRAG的索引

构建索引耗时较长,取决于document的长度

bash 复制代码
poetry run poe index --root .   

或者

bash 复制代码
python -m graphrag.index --root .

进行查询

全局查询 :更侧重全文理解

bash 复制代码
poetry run poe query --root . --method global "本文主要讲了什么"

运行成功后可以看到输出结果

局部查询:更侧重细节

bash 复制代码
poetry run poe query --root . --method local "本文主要讲了什么"   

运行成功后可以看到输出结果

Neo4j可视化

注:Neo4j的安装可以参考文章juejin.cn/post/715718...

将.parquet转换为.csv

上述流程运行成功后会在output的文件夹下生成.parquet文件

为了可以顺利导入neo4j,需要先将其转换为csv文件,这里给出转换脚本

python 复制代码
import os
import pandas as pd
import csv

# Define the directory containing Parquet files
parquet_dir = 'graphrag/output/20240729-173545/artifacts'
csv_dir = 'graphrag/output/20240729-173545/csv'


# Function to clean and properly format the string fields
def clean_quotes(value):
    if isinstance(value, str):
        # Remove extra quotes and strip leading/trailing spaces
        value = value.strip().replace('""', '"').replace('"', '')
        # Ensure proper quoting for fields with commas or quotes
        if ',' in value or '"' in value:
            value = f'"{value}"'
    return value


# Convert all Parquet files to CSV
for file_name in os.listdir(parquet_dir):
    if file_name.endswith('.parquet'):
        parquet_file = os.path.join(parquet_dir, file_name)
        csv_file = os.path.join(csv_dir, file_name.replace('.parquet', '.csv'))

        # Load the Parquet file
        df = pd.read_parquet(parquet_file)

        # Clean quotes in string fields
        for column in df.select_dtypes(include=['object']).columns:
            df[column] = df[column].apply(clean_quotes)

        # Save to CSV
        df.to_csv(csv_file, index=False, quoting=csv.QUOTE_NONNUMERIC)
        print(f"Converted {parquet_file} to {csv_file} successfully.")

print("All Parquet files have been converted to CSV.")

运行成功后会生成对应的csv文件

将csv文件导入neo4j

将所有的生成的csv文件复制到neo4j安装目录的import文件夹下

启动neo4j,执行以下导入指令:

js 复制代码
// 1. Import Documents
LOAD CSV WITH HEADERS FROM 'file:///create_final_documents.csv' AS row
CREATE (d:Document {
    id: row.id,
    title: row.title,
    raw_content: row.raw_content,
    text_unit_ids: row.text_unit_ids
});

// 2. Import Text Units
LOAD CSV WITH HEADERS FROM 'file:///create_final_text_units.csv' AS row
CREATE (t:TextUnit {
    id: row.id,
    text: row.text,
    n_tokens: toFloat(row.n_tokens),
    document_ids: row.document_ids,
    entity_ids: row.entity_ids,
    relationship_ids: row.relationship_ids
});

// 3. Import Entities
LOAD CSV WITH HEADERS FROM 'file:///create_final_entities.csv' AS row
CREATE (e:Entity {
    id: row.id,
    name: row.name,
    type: row.type,
    description: row.description,
    human_readable_id: toInteger(row.human_readable_id),
    text_unit_ids: row.text_unit_ids
});

// 4. Import Relationships
LOAD CSV WITH HEADERS FROM 'file:///create_final_relationships.csv' AS row
CREATE (r:Relationship {
    source: row.source,
    target: row.target,
    weight: toFloat(row.weight),
    description: row.description,
    id: row.id,
    human_readable_id: row.human_readable_id,
    source_degree: toInteger(row.source_degree),
    target_degree: toInteger(row.target_degree),
    rank: toInteger(row.rank),
    text_unit_ids: row.text_unit_ids
});

// 5. Import Nodes
LOAD CSV WITH HEADERS FROM 'file:///create_final_nodes.csv' AS row
CREATE (n:Node {
    id: row.id,
    level: toInteger(row.level),
    title: row.title,
    type: row.type,
    description: row.description,
    source_id: row.source_id,
    community: row.community,
    degree: toInteger(row.degree),
    human_readable_id: toInteger(row.human_readable_id),
    size: toInteger(row.size),
    entity_type: row.entity_type,
    top_level_node_id: row.top_level_node_id,
    x: toInteger(row.x),
    y: toInteger(row.y)
});

// 6. Import Communities
LOAD CSV WITH HEADERS FROM 'file:///create_final_communities.csv' AS row
CREATE (c:Community {
    id: row.id,
    title: row.title,
    level: toInteger(row.level),
    raw_community: row.raw_community,
    relationship_ids: row.relationship_ids,
    text_unit_ids: row.text_unit_ids
});

// 7. Import Community Reports
LOAD CSV WITH HEADERS FROM 'file:///create_final_community_reports.csv' AS row
CREATE (cr:CommunityReport {
    id: row.id,
    community: row.community,
    full_content: row.full_content,
    level: toInteger(row.level),
    rank: toFloat(row.rank),
    title: row.title,
    rank_explanation: row.rank_explanation,
    summary: row.summary,
    findings: row.findings,
    full_content_json: row.full_content_json
});

// 8. Create indexes for better performance
CREATE INDEX FOR (d:Document) ON (d.id);
CREATE INDEX FOR (t:TextUnit) ON (t.id);
CREATE INDEX FOR (e:Entity) ON (e.id);
CREATE INDEX FOR (r:Relationship) ON (r.id);
CREATE INDEX FOR (n:Node) ON (n.id);
CREATE INDEX FOR (c:Community) ON (c.id);
CREATE INDEX FOR (cr:CommunityReport) ON (cr.id);

// 9. Create relationships after all nodes are imported
MATCH (d:Document)
UNWIND split(d.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (d)-[:HAS_TEXT_UNIT]->(t);

MATCH (t:TextUnit)
UNWIND split(t.document_ids, ',') AS docId
MATCH (d:Document {id: trim(docId)})
CREATE (t)-[:BELONGS_TO]->(d);

MATCH (t:TextUnit)
UNWIND split(t.entity_ids, ',') AS entityId
MATCH (e:Entity {id: trim(entityId)})
CREATE (t)-[:HAS_ENTITY]->(e);

MATCH (t:TextUnit)
UNWIND split(t.relationship_ids, ',') AS relId
MATCH (r:Relationship {id: trim(relId)})
CREATE (t)-[:HAS_RELATIONSHIP]->(r);

MATCH (e:Entity)
UNWIND split(e.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (e)-[:MENTIONED_IN]->(t);

MATCH (r:Relationship)
MATCH (source:Entity {name: r.source})
MATCH (target:Entity {name: r.target})
CREATE (source)-[:RELATES_TO]->(target);

MATCH (r:Relationship)
UNWIND split(r.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (r)-[:MENTIONED_IN]->(t);

MATCH (c:Community)
UNWIND split(c.relationship_ids, ',') AS relId
MATCH (r:Relationship {id: trim(relId)})
CREATE (c)-[:HAS_RELATIONSHIP]->(r);

MATCH (c:Community)
UNWIND split(c.text_unit_ids, ',') AS textUnitId
MATCH (t:TextUnit {id: trim(textUnitId)})
CREATE (c)-[:HAS_TEXT_UNIT]->(t);

MATCH (cr:CommunityReport)
MATCH (c:Community {id: cr.community})
CREATE (cr)-[:REPORTS_ON]->(c);

运行成功后有以下显示

知识图谱可视化

可以针对不同的标签查询其相关关系,以下为相关查询cypher

js 复制代码
// 1. Visualize Document to TextUnit relationships
MATCH (d:Document)-[r:HAS_TEXT_UNIT]->(t:TextUnit)
RETURN d, r, t
LIMIT 50;

// 2. Visualize Entity to TextUnit relationships
MATCH (e:Entity)-[r:MENTIONED_IN]->(t:TextUnit)
RETURN e, r, t
LIMIT 50;

// 3. Visualize Relationships between Entities
MATCH (e1:Entity)-[r:RELATES_TO]->(e2:Entity)
RETURN e1, r, e2
LIMIT 50;

// 4. Visualize Community to Relationship connections
MATCH (c:Community)-[r:HAS_RELATIONSHIP]->(rel:Relationship)
RETURN c, r, rel
LIMIT 50;

// 5. Visualize Community Reports and their Communities
MATCH (cr:CommunityReport)-[r:REPORTS_ON]->(c:Community)
RETURN cr, r, c
LIMIT 50;

// 6. Visualize the most connected Entities (Updated)
MATCH (e:Entity)
WITH e, COUNT((e)-[:RELATES_TO]->(:Entity)) AS degree
ORDER BY degree DESC
LIMIT 10
MATCH (e)-[r:RELATES_TO]->(other:Entity)
RETURN e, r, other;

// 7. Visualize TextUnits and their connections to Entities and Relationships
MATCH (t:TextUnit)-[:HAS_ENTITY]->(e:Entity)
MATCH (t)-[:HAS_RELATIONSHIP]->(r:Relationship)
RETURN t, e, r
LIMIT 50;

// 8. Visualize Documents and their associated Entities (via TextUnits)
MATCH (d:Document)-[:HAS_TEXT_UNIT]->(t:TextUnit)-[:HAS_ENTITY]->(e:Entity)
RETURN d, t, e
LIMIT 50;

// 9. Visualize Communities and their TextUnits
MATCH (c:Community)-[:HAS_TEXT_UNIT]->(t:TextUnit)
RETURN c, t
LIMIT 50;

// 10. Visualize Relationships and their associated TextUnits
MATCH (r:Relationship)-[:MENTIONED_IN]->(t:TextUnit)
RETURN r, t
LIMIT 50;

// 11. Visualize Entities of different types and their relationships
MATCH (e1:Entity)-[r:RELATES_TO]->(e2:Entity)
WHERE e1.type <> e2.type
RETURN e1, r, e2
LIMIT 50;

// 12. Visualize the distribution of Entity types
MATCH (e:Entity)
RETURN e.type AS EntityType, COUNT(e) AS Count
ORDER BY Count DESC;

// 13. Visualize the most frequently occurring relationships
MATCH ()-[r:RELATES_TO]->()
RETURN TYPE(r) AS RelationshipType, COUNT(r) AS Count
ORDER BY Count DESC
LIMIT 10;

// 14. Visualize the path from Document to Entity
MATCH path = (d:Document)-[:HAS_TEXT_UNIT]->(t:TextUnit)-[:HAS_ENTITY]->(e:Entity)
RETURN path
LIMIT 25;

举例展示:执行命令3

js 复制代码
// 3. Visualize Relationships between Entities
MATCH (e1:Entity)-[r:RELATES_TO]->(e2:Entity)
RETURN e1, r, e2
LIMIT 50;

运行结果:

相关推荐
无敌小茶4 分钟前
Linux学习笔记之环境变量
linux·笔记
__lost17 分钟前
Python图像变清晰与锐化,调整对比度,高斯滤波除躁,卷积锐化,中值滤波钝化,神经网络变清晰
python·opencv·计算机视觉
Harbor Lau17 分钟前
Linux常用中间件命令大全
linux·运维·中间件
海绵波波10722 分钟前
玉米产量遥感估产系统的开发实践(持续迭代与更新)
python·flask
欣然~25 分钟前
借助 OpenCV 和 PyTorch 库,利用卷积神经网络提取图像边缘特征
人工智能·计算机视觉
逢生博客1 小时前
使用 Python 项目管理工具 uv 快速创建 MCP 服务(Cherry Studio、Trae 添加 MCP 服务)
python·sqlite·uv·deepseek·trae·cherry studio·mcp服务
꧁坚持很酷꧂1 小时前
Linux Ubuntu18.04下安装Qt Craeator 5.12.9(图文详解)
linux·运维·qt
堕落似梦1 小时前
Pydantic增强SQLALchemy序列化(FastAPI直接输出SQLALchemy查询集)
python
白熊1881 小时前
【计算机视觉】CV实战项目 - 基于YOLOv5的人脸检测与关键点定位系统深度解析
人工智能·yolo·计算机视觉
nenchoumi31191 小时前
VLA 论文精读(十六)FP3: A 3D Foundation Policy for Robotic Manipulation
论文阅读·人工智能·笔记·学习·vln