安装
python安装12的
通过 pycharm新建项目 ,方便调试,方便在依赖包的代码中添加代码定位错误信息;
安装 graphrag 1.2.0版本 。
在终端里输入 pip install graphrag==1.2.0
创建目录ragtest
然后在其下创建子目录input 拷贝一些文件到这个input文件夹,这些文件就是你的私人的知识库里的文件了
初始化
graphrag init --root ./ragtest
修改配置文件
修改.evn文件,注意调整成自己的key GRAPHRAG_API_KEY=sk-XXX
修改setting.yaml
yaml
### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/
### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.
encoding_model: cl100k_base # this needs to be matched to your model!
llm:
api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file
type: openai_chat # or azure_openai_chat
model: deepseek-ai/DeepSeek-V3
api_base: https://api.siliconflow.cn/v1
model_supports_json: false # recommended if this is available for your model.
# audience: "https://cognitiveservices.azure.com/.default"
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-02-15-preview
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
parallelization:
stagger: 0.3
# num_threads: 50
async_mode: threaded # or asyncio
embeddings:
async_mode: threaded # or asyncio
vector_store:
type: lancedb # one of [lancedb, azure_ai_search, cosmosdb]
db_uri: 'output\lancedb'
collection_name: default
overwrite: true
llm:
api_key: ${GRAPHRAG_API_KEY}
type: openai_embedding # or azure_openai_embedding
model: BAAI/bge-m3
api_base: https://api.siliconflow.cn/v1
# api_base: https://<instance>.openai.azure.com
# api_version: 2024-02-15-preview
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
# deployment_name: <azure_model_deployment_name>
### Input settings ###
input:
type: file # or blob
file_type: text # or csv
base_dir: "input"
file_encoding: utf-8
file_pattern: ".*\.txt$"
chunks:
size: 300
overlap: 100
group_by_columns: [id]
### Storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided
cache:
type: file # one of [blob, cosmosdb, file]
base_dir: "cache"
reporting:
type: file # or console, blob
base_dir: "logs"
storage:
type: file # one of [blob, cosmosdb, file]
base_dir: "output"
## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_storage:
# type: file # or blob
# base_dir: "update_output"
### Workflow settings ###
skip_workflows: []
entity_extraction:
prompt: "prompts/entity_extraction.txt"
entity_types: [organization,person,geo,event]
max_gleanings: 1
summarize_descriptions:
prompt: "prompts/summarize_descriptions.txt"
max_length: 500
claim_extraction:
enabled: false
prompt: "prompts/claim_extraction.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1
community_reports:
prompt: "prompts/community_report.txt"
max_length: 2000
max_input_length: 3000
cluster_graph:
max_cluster_size: 10
embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes
umap:
enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)
snapshots:
graphml: false
embeddings: false
transient: false
### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query
local_search:
prompt: "prompts/local_search_system_prompt.txt"
global_search:
map_prompt: "prompts/global_search_map_system_prompt.txt"
reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"
drift_search:
prompt: "prompts/drift_search_system_prompt.txt"
reduce_prompt: "prompts/drift_search_reduce_prompt.txt"
basic_search:
prompt: "prompts/basic_search_system_prompt.txt"
改成这行:Return output as a well-formed JSON-formatted string with the following format,but don't output in markdown format, the output string should be directly usable by json.load():
建立索引
graphrag index --root ./ragtest
查询测试
graphrag query --root ./ragtest --method global --query "虚竹是谁?"
提示词生成
graphrag prompt-tune --root .\ragtest --domain 小说 --language 中文 --chunk-size 200 --output .\prompt-cs 这会在.\prompt-cs下生成针对计算机领域、中文文本的提示词。接下来把新生成的提示词覆盖.\ragtest\prompt下默认的提示词,然后构建索引。
不建议直接覆盖,由于是通过模型生成的提示词,可能会有问题,需要排查。
可能报错
❌ create_final_community_reports :修改setting.yaml 中的模型与参数, 可通过log日志查询问题点,调整参数,例如:

❌ generate_text_embeddings :
可能的原因
- 模型参数配置不当 :模型的参数可能配置不当,例如
max_input_tokens
设置过大或过小,导致模型无法生成有效的 JSON 输出。 - 模型输入内容复杂或格式不匹配:输入内容可能过于复杂或不符合模型的输入格式要求。
- 网络问题或 API 调用失败:网络不稳定或 API 调用失败可能导致模型无法正常响应。
- 模型能力不足:模型可能无法处理某些特定的输入内容,导致生成的输出不符合预期格式。