系列文章目录
玩转大语言模型------使用langchain和Ollama本地部署大语言模型
玩转大语言模型------ollama导入huggingface下载的模型
玩转大语言模型------langchain调用ollama视觉多模态语言模型
玩转大语言模型------使用GraphRAG+Ollama构建知识图谱
玩转大语言模型------完美解决GraphRAG构建的知识图谱全为英文的问题
玩转大语言模型------配置图数据库Neo4j(含apoc插件)并导入GraphRAG生成的知识图谱
玩转大语言模型------本地部署带聊天界面deepseek R1的小白教程
文章目录
前言
在之前的文章中笔者解决了使用本地模型部署GraphRAG并生成知识图谱的过程,并且解决了原本提示词只生成英文知识图谱的问题,在本篇中,笔者将配置Neo4j图数据库并导入GraphRAG生成的知识图谱数据。以往的内容参照:玩转大语言模型------使用GraphRAG+Ollama构建知识图谱、玩转大语言模型------完美解决GraphRAG构建的知识图谱全为英文的问题。
安装JDK
Neo4j使用Java开发的,所以首先需要安装JDK。如果没有安装过JDK,需要先到官网下载安装。
官网:https://www.oracle.com/java/technologies/downloads/?er=221886#java11-windows
选择合适的版本下载
跟随指引安装即可。
安装Neo4j
下载Neo4j
Neo4j官网:https://neo4j.com/deployment-center/
下载好后是个压缩包,将其解压到你的目标安装目录即可,注意记一下解压后的地址,需要配置环境变量,笔者的地址是D:\neo4j-community-5.26.1
,配置时可以做参考
配置环境变量
打开编辑环境变量,新建系统环境变量:名为NEO4J_HOME,值为D:\neo4j-community-5.26.1
修改Path变量:在其值中增加
双击后点新建
%NEO4J_HOME%\bin
安装apoc插件
导入知识图谱时,会用到apoc插件的部分功能,所以首先要安装apoc。
apoc版本地址:https://github.com/neo4j/apoc/releases?page=1
点击下载后放到路径:neo4j路径/plugins
下
找到路径:neo4j路径/conf
下的neo4j.conf
,在文件内容的末尾添加以下配置并保存。
dbms.security.procedures.unrestricted=apoc.*
dbms.security.procedures.allowlist=apoc.*
server.jvm.additional=-Dapoc.export.file.enabled=true
server.jvm.additional=-Dapoc.import.file.enabled=true
dbms.security.allow_csv_import_from_file_urls=true
在neo4j路径/conf
下新建一个apoc.conf
文件
在文件中写入以下配置并保存。
apoc.export.file.enabled=true
apoc.import.file.use_neo4j_config=false
apoc.import.file.enabled=true
apoc.import.file.directory=D:/Neo4j/neo4j-community-5.13.0-windows/neo4j-community-5.13.0/import
apoc.export.file.directory=D:/Neo4j/neo4j-community-5.13.0-windows/neo4j-community-5.13.0/export
导入知识图谱
启动Neo4j
在命令行输入
bash
neo4j console
之后在浏览器搜索:http://localhost:7474
进行用户创建。
初始用户名及密码都是neo4j
,之后会让重置密码。
如果想持续在后台运行数据库,可以使用以下命令
bash
neo4j start
如果neo4j start
时报错,可以执行以下命令安装service。
bash
neo4j windows-service install
安装成功后重新使用命令neo4j start
即可,但使用neo4j start
命令开启的服务在停止时需要调用neo4j stop
停止运行
使用Python导入知识图谱
使用pip安装相关包
pip install --quiet pandas neo4j-rust-ext
不确定是由于使用的模型的问题还是GraphRAG本身的问题,实际导入的方式与官方提供的方式略有差距,主要体现在某些字段的命名上。如果笔者已经足够熟悉Neo4j可以自行修改,但如果只是想看一下知识图谱生成的效果可以参照笔者的方式修改。尽管在笔者看来,他的构建方式导入的图数据库展示效果并不会,实际上人工处理一下,自己构建会更加准确。
导入包
python
import time
import pandas as pd
from neo4j import GraphDatabase
设置数据库参数
python
GRAPHRAG_FOLDER = "ragtest/output"
NEO4J_URI = "neo4j://localhost" # or neo4j+s://xxxx.databases.neo4j.io
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "your password"
NEO4J_DATABASE = "neo4j"
实例化Neo4j driver
python
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))
构建批量导入函数
python
def batched_import(statement, df, batch_size=1000):
"""
Import a dataframe into Neo4j using a batched approach.
Parameters: statement is the Cypher query to execute, df is the dataframe to import, and batch_size is the number of rows to import in each batch.
"""
total = len(df)
start_s = time.time()
for start in range(0, total, batch_size):
batch = df.iloc[start : min(start + batch_size, total)]
result = driver.execute_query(
"UNWIND $rows AS value " + statement,
rows=batch.to_dict("records"),
database_=NEO4J_DATABASE,
)
print(result.summary.counters)
print(f"{total} rows in {time.time() - start_s} s.")
return total
创建constraints, idempotent操作
python
statements = [
"\ncreate constraint chunk_id if not exists for (c:__Chunk__) require c.id is unique",
"\ncreate constraint document_id if not exists for (d:__Document__) require d.id is unique",
"\ncreate constraint entity_id if not exists for (c:__Community__) require c.community is unique",
"\ncreate constraint entity_id if not exists for (e:__Entity__) require e.id is unique",
"\ncreate constraint entity_title if not exists for (e:__Entity__) require e.name is unique",
"\ncreate constraint entity_title if not exists for (e:__Covariate__) require e.title is unique",
"\ncreate constraint related_id if not exists for ()-[rel:RELATED]->() require rel.id is unique",
"\n",
]
for statement in statements:
if len((statement or "").strip()) > 0:
print(statement)
driver.execute_query(statement)
导入create_final_documents.parquet
python
doc_df = pd.read_parquet(
f"{GRAPHRAG_FOLDER}/create_final_documents.parquet", columns=["id", "title"]
)
# Import documents
statement = """
MERGE (d:__Document__ {id:value.id})
SET d += value {.title}
"""
batched_import(statement, doc_df)
导入create_final_text_units.parquet
python
text_df = pd.read_parquet(
f"{GRAPHRAG_FOLDER}/create_final_text_units.parquet",
columns=["id", "text", "n_tokens", "document_ids"],
)
statement = """
MERGE (c:__Chunk__ {id:value.id})
SET c += value {.text, .n_tokens}
WITH c, value
UNWIND value.document_ids AS document
MATCH (d:__Document__ {id:document})
MERGE (c)-[:PART_OF]->(d)
"""
batched_import(statement, text_df)
导入create_final_entities.parquet
python
entity_df = pd.read_parquet(
f"{GRAPHRAG_FOLDER}/create_final_entities.parquet",
columns=[
"title",
"type",
"description",
"human_readable_id",
"id",
# "description_embedding",
"text_unit_ids",
],
)
entity_df.rename(columns={"title": "name"}, inplace=True)
entity_statement = """
MERGE (e:__Entity__ {id:value.id})
SET e += value {.human_readable_id, .description, name:replace(value.name,'"','')}
WITH e, value
CALL apoc.create.addLabels(e, case when coalesce(value.type,"") = "" then [] else [apoc.text.upperCamelCase(replace(value.type,'"',''))] end) yield node
UNWIND value.text_unit_ids AS text_unit
MATCH (c:__Chunk__ {id:text_unit})
MERGE (c)-[:HAS_ENTITY]->(e)
"""
batched_import(entity_statement, entity_df)
导入create_final_relationships.parquet
python
rel_df = pd.read_parquet(
f"{GRAPHRAG_FOLDER}/create_final_relationships.parquet",
columns=[
"source",
"target",
"id",
# "rank",
"weight",
"human_readable_id",
"description",
"text_unit_ids",
],
)
rel_statement = """
MATCH (source:__Entity__ {name:replace(value.source,'"','')})
MATCH (target:__Entity__ {name:replace(value.target,'"','')})
// not necessary to merge on id as there is only one relationship per pair
MERGE (source)-[rel:RELATED {id: value.id}]->(target)
SET rel += value {.weight, .human_readable_id, .description, .text_unit_ids}
RETURN count(*) as createdRels
"""
batched_import(rel_statement, rel_df)
导入create_final_communities.parquet
python
community_df = pd.read_parquet(
f"{GRAPHRAG_FOLDER}/create_final_communities.parquet",
columns=["id", "level", "title", "text_unit_ids", "relationship_ids"],
)
statement = """
MERGE (c:__Community__ {community:value.id})
SET c += value {.level, .title}
/*
UNWIND value.text_unit_ids as text_unit_id
MATCH (t:__Chunk__ {id:text_unit_id})
MERGE (c)-[:HAS_CHUNK]->(t)
WITH distinct c, value
*/
WITH *
UNWIND value.relationship_ids as rel_id
MATCH (start:__Entity__)-[:RELATED {id:rel_id}]->(end:__Entity__)
MERGE (start)-[:IN_COMMUNITY]->(c)
MERGE (end)-[:IN_COMMUNITY]->(c)
RETURn count(distinct c) as createdCommunities
"""
batched_import(statement, community_df)
导入create_final_community_reports.parque
python
community_report_df = pd.read_parquet(
f"{GRAPHRAG_FOLDER}/create_final_community_reports.parquet",
columns=[
"id",
"community",
"level",
"title",
"summary",
"findings",
"rank",
"rank_explanation",
"full_content",
],
)
# Import communities
community_statement = """
MERGE (c:__Community__ {community:value.community})
SET c += value {.level, .title, .rank, .rank_explanation, .full_content, .summary}
WITH c, value
UNWIND range(0, size(value.findings)-1) AS finding_idx
WITH c, value, finding_idx, value.findings[finding_idx] as finding
MERGE (c)-[:HAS_FINDING]->(f:Finding {id:finding_idx})
SET f += finding
"""
batched_import(community_statement, community_report_df)
导入create_final_nodes.parquet
python
nodes_df = pd.read_parquet(f"{GRAPHRAG_FOLDER}/create_final_nodes.parquet")
nodes_statement = """
MERGE (c:__Covariate__ {id:value.id})
SET c += apoc.map.clean(value, ["text_unit_id", "document_ids", "n_tokens"], [NULL, ""])
WITH c, value
MATCH (ch:__Chunk__ {id: value.text_unit_id})
MERGE (ch)-[:HAS_COVARIATE]->(c)
"""
batched_import(nodes_statement, nodes_df)
显示知识图谱
启动Neo4j后访问http://localhost:7474
可以看到效果还可以,不过可能由于使用的是本地模型,逻辑能力较差,所以有些实体之间的关系并没有理清,需要通过人工去做一下知识图谱的数据。不过从做数据的角度来看,如果没有知识图谱的需求,通过事件和实体查找的话应该可以找全相关的信息,只能说当前的这种方式差强人意。