手扒Github项目文档级知识图谱构建框架RAKG（保姆级）Day4

`kgAgent.py` 的解读

继文本向量化操作结束后，就要实现真正的知识抽取部分，在本项目中，知识抽取功能主要通过kgAgent.py中的NER_Agent类来实现，其中包括了初始化，实体抽取，实体消歧，构建图谱这些过程，本次解读聚焦于实体抽取和实体消歧过程，精细描述代码中的细节，并写下测试代码，探索实体抽取和实体消歧两大过程，并做出一些总结，为开发者提供参考。

查看导入的库和工具包，可以看到，导入了src目录下的prompt和llm_provider

python 复制代码

#kgAgent文件
from langchain_core.prompts import ChatPromptTemplate
from src.prompt import text2entity_cn    # 改成text2entity_cn可变中文
from src.prompt import extract_entiry_centric_kg_en_v2
from src.prompt import judge_sim_entity_en
from itertools import combinations
import re
import json
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from src.llm_provider import LLMProvider

接下来一以贯之的延伸阅读，发散到prompt.py查看具体导入了什么（llm_provider在Day2已经讲解，可以回看查看更多细节），prompt.py中，text2entity如下

bash 复制代码

text2entity_en = """
You are a named entity recognition assistant responsible for identifying named entities from the given text.
Text: {text}
Notes:
1. First, you should determine whether the text contains any information. If it's just meaningless symbols, directly output: {{State : False}}. If the text contains information, proceed to the next step.
2. You should consider the entire text for named entity recognition.
3. The identified entities should consist of three parts: name, type, and description.
    - name: The main subject of the named entity.
    - type: The category of the subject.
    - description: A summary description of the subject, explaining what it is.
4. Since multiple named entities may be identified in a single text, you need to output them in a specific format.
    The output format should be:
    {{
        "entity1": {{
            "name": "Entity Name 1",
            "type": "Entity Type 1",
            "description": "Entity Description 1"
        }},
        "entity2": {{
            "name": "Entity Name 2",
            "type": "Entity Type 2",
            "description": "Entity Description 2"
        }},
        ...
        "entityn": {{
            "name": "Entity Name n",
            "type": "Entity Type n",
            "description": "Entity Description n"
        }}
    }}
"""

text2entity_cn = """
你是一个命名实体识别助手，负责从给定的文本中识别命名实体
文本：{text}
注意：
1、首先你应当判断这个文本是否包含任何信息，如果只是没有意义的符号，那么直接输出:{{State : Fasle}}，如果文本中存在着信息，那么进行下一步
2、你要综合考虑整个文本进行命名实体识别
3、命名的实体需要由三个部分组成分别是：name、type、description
    name是命名实体的主体
    type是这个主体的类别
    description是这个主体的描述，概括性的描述这个主体是什么
4、由于一段文本中可能识别出多个命名实体，所以你需要以一定的格式来输出
    输出的格式为：
    {{
        "entity1":{{
            "name": "实体名称1",
            "type": "实体类型1",
            "description": "实体描述1"
        }},
        "entity2":{{
            "name": "实体名称2",
            "type": "实体类型2",
            "description": "实体描述2"
        }},
        ...
        "entityn":{{
            "name": "实体名称n",
            "type": "实体类型n",
            "description": "实体描述n"
        }}
    }}
"""

这段prompt向大模型提供了指令，包括：角色预设------------发出指令------------规定输出格式，这样的prompt格式值得学习。大模型可以依据该命令完成实体的识别和抽取。

ini 复制代码

extract_entiry_centric_kg_en_v2 = """
You are a knowledge graph extraction assistant, responsible for extracting attributes and relationships related to a specified entity from the text, in combination with other relevant knowledge graphs.
Text: {text}
Target Entity: {target_entity}
Related Knowledge Graphs: {related_kg}
Requirements for you:
1. You should integrate the entire text to comprehensively extract relationships related to the specified entity and build a sub-graph for the specified entity.
............................................................此处省略
"""

extract_entiry_centric_kg_en_v2这段prompt向大模型发出了构建知识图谱的指令，关于构建知识图谱过程，我们将在后续作探讨，本次只集中实体抽取和实体消歧。

rust 复制代码

judge_sim_entity_en = """
    You are a knowledge graph entity disambiguation assistant responsible for determining whether two entities are essentially the same entity. For example:
    Entity 1: "name": "Henan Business Daily", "type": "Media Organization", "description": "A commercial newspaper in Henan Province that provides news and information reporting." and Entity 2: "name": "Top News·Henan Business Daily", "type": "Organization Name", "description": "A news media organization located in Henan Province, responsible for reporting important local and national news and information."
    Essentially, they are the same entity.
    Entity 1: {entity1}
    Entity 2: {entity2}
    Notes:
    1. You should initially judge whether the two entities might be the same based on their names and types, and if they might be the same, analyze their descriptions in detail to determine if they are indeed the same.
    2. Your output format should be "yes" if you determine that they are the same entity, outputting: {{'result': True}}, and if you determine that they are not the same entity, outputting: {{'result': False}}.
"""

judge_sim_entity_cn = '''
    你是一个知识图谱实体消歧助手，负责判断两个实体本质上是否是同一个实体，例如：
    实体1："name": "河南商报", "type": "媒体机构", "description": "河南省的一家商业报纸，提供新闻和信息报道。"和实体2："name": "顶端新闻·河南商报", "type": "组织名", "description": "一家位于河南省的新闻媒体机构，负责报道地方及全国的重要新闻和信息。
    并且，不同实体，实体的复数、不同时态，都视为同一实体
    本质上是同一个实体
    实体1:{entity1}
    实体2:{entity2}
    注意：
    1、你应当通过name和type大体判断这两个实体是否可能相同，并且在可能相同的情况下，通过description具体分析是否相同
    2、你的输出格式应当为是，当判断确实是同一个实体时输出:{{'result':True}}，判断不是同一个实体时输出:{{'result':False}}
'''

judge_sim_entity_en这段prompt是向大模型发出的完成实体消歧的指令。运用在实体消歧过程。

总的来说，作者把整个kgAgent中需要用的大模型时的提示词单独列在了prompt.py这个文件中，然后直接在NER_Agent方法中调用，这样可以让程序瘦身，避免过长的提示词插入到主程序中影响代码可读性，这样的模式可以被当作经验借鉴。

结束延伸阅读和发散，让我们继续解读kgAgent.py的构建过程，先来看实体抽取过程：

python 复制代码

class NER_Agent():
    def __init__(self):
        self.llm_provider = LLMProvider()
        self.model = self.llm_provider.get_llm()
        self.similarity_model = self.llm_provider.get_similarity_model()
        self.embeddings = self.llm_provider.get_embedding_model()

这里定义了使用的大模型，把主模型，embedding模型以及similarity模型进行定义。get_llm, get_similarity_model, get_embedding_model都是LLMProvider类下的方法。

下面的代码是实现实体抽取的主要过程：

python 复制代码

def extract_from_text_single(self, text_single, output_file):
    # 第1步：准备"任务指令单" (Prompt Template)
    # LangChain 的 ChatPromptTemplate.from_template 是一个强大的工具。
    # 它会加载 text2entity_cn 这个字符串模板。这个模板里大概长这样：
    # "请从下面的文本中提取实体...\n文本：{text}\n请以JSON格式返回..."
    # 这里的 {text} 就是一个占位符，一个等待被填写的空格
    prompt = ChatPromptTemplate.from_template(text2entity_cn)
    chain = prompt | self.model
    # 这里的 `|` 符号是 LangChain 的一个核心特性，叫做 LCEL (LangChain Expression Language)。
    # 它像一个管道，把左边的东西连接到右边的东西上，形成一个工作流。
    # 这行代码的意思是："创建一个流程，先用 prompt 模板格式化输入，然后把格式化好的结果直接传给 self.model (大语言模型) 处理，self.model现在是get_llm()的返回值。"
    result = chain.invoke({"text": text_single})
    # 第3步：执行任务并获取结果 (Invoke)
    # chain.invoke() 就是启动这个流程的按钮。
    # 我们传进去一个字典 {"text": text_single}。
    # LangChain看到这个字典，就会用 text_single 的实际内容去替换掉 prompt 模板里那个 {text} 占位符。
    # 于是，一个完整的、包含具体指令和文本的提示词就生成了，并被发送给 LLM。
    # result 是 LLM 返回的原始结果。

    # 第4步：解析与标准化结果 (Parsing)
    # 不同的 LLM 或 LangChain 版本返回结果的格式可能略有不同！
    # 这段代码非常专业，它做了兼容性处理。
    # hasattr(result, 'content') 检查 result 是否是一个有 .content 属性的对象 (比如 OpenAI 的 AIMessage 对象)。
    # 如果是，说明真正的结果（一个JSON字符串）在 result.content 里面。
    if hasattr(result, 'content'):
        result_json = json.loads(result.content)
        # json.loads() 的作用是把一个 JSON 格式的字符串，转换成 Python 中真实的数据结构（比如字典或列表）。
    else:
        # 如果 result 没有 .content 属性，那么它可能本身就是一个 JSON 字符串。
        # 我们就直接转换它。
        result_json = json.loads(result)
    
    # Store text_single and result_json in a jsonl file
    # 第5步：操作归档 (Logging)
    # `with open(...) as f:` 是 Python 中操作文件最安全的方式，它能确保文件最终一定会被关闭，即使中间出错。
    # 'a' 代表 "append" (追加) 模式。这意味着每次调用这个函数，都会在文件末尾添加新的一行，而不会覆盖掉旧的内容。这对于日志文件来说是完美的。
    with open(output_file, 'a') as f:
        # 我们创建一个字典，把本次操作的原始输入文本和 LLM 返回的实体结果都包含进去。
        combined_data = {
            "text": text_single,
            "entities": result_json
        }
        f.write(json.dumps(combined_data) + '\n')
        # json.dumps() 把 Python 字典转换回 JSON 字符串。
        # `+ '\n'` 是为了确保每个 JSON 对象占一行，形成一个标准的 .jsonl (JSON Lines) 文件，这种格式非常便于逐行读取和处理。
        # f.write(json.dumps(combined_data, ensure_ascii=False) + '\n') 增加 ensure_ascii=False 确保中文正常显示
    return result_json
    # 第6步：返回最终产出 (Return)
    # 函数将处理好的、干净的 Python 字典或列表（result_json）返回给调用它的地方。
    # 在本项目中，这个返回值会成为 `extract_from_text_multiply` 函数的输入，继续下一步的处理。

好的，在了解了代码每一步都在做什么之后，我们用自写的测试代码，了解这个实体抽取过程：

python 复制代码

# test_kg_agent_stage1.py

# 导入我们要测试的 Agent
from src.kgAgent import NER_Agent
import json

# 一段示例文本，模拟系统实际处理的文本片段
sample_text = """
苹果公司（Apple Inc.）由史蒂夫·乔布斯创立，是一家总部位于库比蒂诺的科技巨头。
最近，苹果发布了全新的 Vision Pro，这是一款混合现实头戴设备。
蒂姆·库克是该公司现任的首席执行官。
"""

# Agent 将原始输出记录到的文件名
output_log_file = "test_kg_agent_stage1_2_3_4\stage1_ner_output.jsonl"

print("--- 第一阶段：初始实体提取测试 ---")

# 1. 初始化我们的 NER_Agent
# 这会根据你的 llm_provider.py 加载 LLM 和 Embedding 模型
print("正在初始化 NER_Agent...")
agent = NER_Agent()     # NER_Agent是一个class
print("Agent 初始化完成。")

# 2. 调用处理单个文本的提取方法
# 这是流水线第一阶段的核心函数
print(f"\n正在从以下文本中提取实体:\n---\n{sample_text}\n---")

try:
    # 我们调用这个函数，它会使用 'text2entity_en' prompt 与 LLM 对话
    extracted_entities = agent.extract_from_text_single(sample_text, output_log_file)
    # 这里要完成两个阅读代码任务，明白kgAgent-NER_Agent-extract_from_text_single的代码，以及涉及的prompt部分

    # 3. 分析结果
    print("\n--- 提取完成! ---")
    print("来自 LLM 的原始 JSON 输出:")
    # 我们使用 json.dumps 来美化打印字典
    print(json.dumps(extracted_entities, indent=2, ensure_ascii=False))

    print(f"\n注意: 详细日志已保存至 '{output_log_file}'")

except Exception as e:
    print(f"\n--- 发生错误 ---")
    print(f"错误详情: {e}")
    print("请检查你的 API Key、base_url 以及模型是否可用。")

输出结果如下：

lua 复制代码

--- 第一阶段：初始实体提取测试 ---
正在初始化 NER_Agent...
Agent 初始化完成。

正在从以下文本中提取实体:
---

苹果公司（Apple Inc.）由史蒂夫·乔布斯创立，是一家总部位于库比蒂诺的科技巨头。
最近，苹果发布了全新的 Vision Pro，这是一款混合现实头戴设备。
蒂姆·库克是该公司现任的首席执行官。

---

--- 提取完成! ---
来自 LLM 的原始 JSON 输出:
{
  "entity1": {
    "name": "苹果公司",
    "type": "Company",
    "description": "一家总部位于库比蒂诺的科技巨头"
  },
  "entity2": {
    "name": "Apple Inc.",
    "type": "Company",
    "description": "苹果公司的英文名称"
  },
  "entity3": {
    "name": "史蒂夫·乔布斯",
    "type": "Person",
    "description": "苹果公司创始人"
  },
  "entity4": {
    "name": "库比蒂诺",
    "type": "Location",
    "description": "苹果公司的总部所在地"
  },
  "entity5": {
    "name": "Vision Pro",
    "type": "Product",
    "description": "苹果公司发布的一款混合现实头戴设备"
  },
  "entity6": {
    "name": "蒂姆·库克",
    "type": "Person",
    "description": "苹果公司的现任首席执行官"
  }
}

注意: 详细日志已保存至 'test_kg_agent_stage1_2_3_4\stage1_ner_output.jsonl'

总的来说，从我们的测试代码来看，实体抽取的部分主要利用了NER_Agent的extract_from_text_single方法，主要包括加载文本，调用extract_from_text_single方法获取抽取结果两个部分，而extract_from_text_single方法又主要包括了获取prompt，使用Langchain链式结构调用模型获取结果，标准化结果，归档结果四部分，就是这样层层嵌套的过程，完成了实体抽取，虽然嵌套起来理解较为复杂，但确实是简化编程操作，提高效率的方法。

接下来是实体消歧过程，主要包括，赋ID------------粗筛相似实体（候选人模式）------------llm精细筛选------------获取筛选确定的后实体------------实体合并，下面这部分代码主要用于实体消歧过程中添加切片id的过程：

ruby 复制代码

## Add chunkid attribute
# 定义一个方法，它属于 NER_Agent 类。
# 它接收三个参数：
# self: 代表 NER_Agent 对象本身。
# ner_result: 一个字典，包含了从单个文本块中提取出的所有实体。
# chunkid: 一个字符串或数字，代表这个文本块的唯一标识符。
def add_chunkid(self, ner_result, chunkid):

    # 第1步：准备一个空的新箱子 (字典)
    # 我们创建一个新的空字典，而不是直接在原始的 ner_result 上修改。
    new_ner_result = {}

    # 第2步：遍历旧箱子里的每一件文物 (实体)
    # ner_result.items() 会同时返回字典的键 (key) 和值 (value)。
    # 在第一次循环中：
    # entity_key 会是 "entity1"
    # entity_value 会是 {"name": "苹果公司", "type": "Company", ...}
    for entity_key, entity_value in ner_result.items():
        # 第3步：给文物贴上来源标签
        # 这是整个函数的核心操作。
        # entity_value 是一个字典，我们直接给这个字典添加一个新的键值对。
        # 键是 "chunkid"，值就是我们传进来的 chunkid (比如 "doc1_chunk1")。
        # 执行后，entity_value 就变成了：
        # {"name": "苹果公司", "type": "Company", ..., "chunkid": "doc1_chunk1"}
        entity_value["chunkid"] = chunkid

        # 第4步：将贴好标签的文物放入新箱子
        # 我们用原来的 key (entity_key) 和被修改后的 value (entity_value)，
        # 在新的字典 new_ner_result 中创建一条一模一样的记录。
        new_ner_result[entity_key] = entity_value

    # 第5步：返回装满新文物的箱子
    # 当 for 循环结束后，new_ner_result 里就包含了所有被贴上标签的实体。
    # 函数将这个全新的、信息更丰富的字典返回。
    return new_ner_result

下面这部分代码主要用于实体消歧过程中粗筛过程：先使用粗筛，以一定阈值过滤出最值得怀疑的实体，这样比用llm分析速度要快的多。为什么不直接用LLM判断所有实体对？
- 如果有100个实体，需要判断的实体对数量是 C(100,2) = 4,950 对
- 每次LLM调用成本高（时间+金钱），4,950次调用代价巨大
- 这个函数用"便宜"的向量计算先筛掉95%的明显不相关的对，只把"可疑"的5%交给"昂贵"的LLM

ini 复制代码

def similarity_candidates(self, entities, threshold=0.60):
    # threshold=0.60 是一个经验值，意思是"向量相似度超过0.6的实体对才值得进一步考虑"
    # 为什么是0.6？这是在"召回率"和"精确率"之间的平衡点：
    # 太高（如0.9）：可能漏掉一些应该合并的实体（召回率低）
    # 太低（如0.3）：会产生太多噪音，增加后续LLM的负担（精确率低）
    def get_embedding_vector(text):
        # .embed_documents() 是许多嵌入模型（Embedding Model）库中常见的方法，
        # 主要用于将多个文档（或长文本）转换为对应的嵌入向量（Embedding Vectors）。
        # embed_documents是为"文档"设计的，通常针对较长的文本进行优化
        # embed_query是为"查询"设计的，通常针对短问句进行优化
        result = self.embeddings.embed_documents([text])
        # 抹平不同 Embedding 库或模型返回格式的差异，拆包，[],[[]]
        if isinstance(result, list) and isinstance(result[0], list):
            return result[0]
        else:
            return result
    
    # --- 第1步：为每个实体准备一段"描述性文本" ---
    # 为了让 embedding 模型更好地理解每个实体的含义，我们不能只用它的名字。
    # 这里非常聪明地将实体的 `name` 和 `type` 拼接在了一起。
    # 比如，对于 "苹果公司"，它会生成 "苹果公司 Company"。
    # 这样做的好处是，"苹果 Company" 和 "苹果 Fruit" 在语义上就有了天壤之别，
    # # 输入：
    # entities = {
    #     "entity1": {"name": "苹果", "type": "公司"},
    #     "entity2": {"name": "Apple", "type": "科技企业"},
    #     "entity3": {"name": "苹果", "type": "水果"}
    # }
    #
    # # 输出：
    # entity_texts = {
    #     "entity1": "苹果 公司",
    #     "entity2": "Apple 科技企业",
    #     "entity3": "苹果 水果"
    # }
    entity_texts = {
        k: f"{v['name']} {v['type']}"
        for k, v in entities.items()
    }
    # vectors = {"entity1": [0.1, 0.2, 0.3, ...]}  # 这是一个高维向量
    vectors = {
        k: get_embedding_vector(text)
        for k, text in entity_texts.items()
    }
    
    # **这是批量向量化步骤**
    # 为每个实体的"name + type"文本生成对应的高维向量（通常512或1024维）
    # 结果是一个字典：`{"entity1": [0.1, 0.2, ...], "entity2": [0.3, 0.4, ...], ...}`
    keys = list(vectors.keys())
    sim_matrix = np.zeros((len(keys), len(keys)))
    # 高效的写法（combinations确保只计算一次）：
    # 为什么用余弦相似度？
    # 余弦相似度衡量的是向量的"方向"相似性，不受向量长度影响
    # 这对文本embedding非常合适：两个语义相近的文本，即使长度不同，其向量方向应该相似
    # 范围是[-1, 1]，1表示完全相似，0表示无关，-1表示完全相反，为什么余弦相似度特别适合文本embeddings？
    # 根据搜索结果，余弦相似度是衡量两个向量之间角度的余弦值，它衡量的是方向相似性而不是大小，
    # 这对于文本embedding来说非常重要，因为它能捕捉语义概念之间的相似性。
    for i, j in combinations(range(len(keys)), 2):
        sim = cosine_similarity([vectors[keys[i]]], [vectors[keys[j]]])    # 只会生成 (i,j)，其中i < j
        # print(f"Similarity between {keys[i]} and {keys[j]}: {sim}")
        sim_matrix[i][j] = sim
        
    # np.where(sim_matrix > threshold)
    # 找出矩阵中所有大于阈值的位置
    # 返回两个数组：行索引数组和列索引数组
    # 例如：(array([0, 1]), array([2, 3]))表示位置(0,2)和(1,3)满足条件
    # zip(*np.where(...))
    # *操作符解包上面的结果
    # zip将对应位置的元素配对
    # 结果：[(0, 2), (1, 3)] - 即坐标对列表
    # 列表推导式生成最终候选对
    # 将数字索引转换回真实的实体ID
    # keys[i]和keys[j]获取实际的entity键名
    candidates = [
        (keys[i], keys[j])
        for i, j in zip(*np.where(sim_matrix > threshold))
    ]
    return candidates

知识点1：def get_embedding_vector(text)的作用：不同的embedding模型或不同版本的LangChain可能返回不同格式：有的返回：[[0.1, 0.2, 0.3, ...]] (二维列表) 有的返回：[0.1, 0.2, 0.3, ...] (一维列表) 这个函数做了"防御性编程"，确保无论什么格式，最终都能得到一个标准的一维向量

知识点2： keys = list(vectors.keys())和sim_matrix = np.zeros((len(keys), len(keys)))的作用： 为什么要创建相似度矩阵？这是一个经典的对称矩阵优化 策略：假设有5个实体，需要计算的相似度对有C(5,2) = 10对。但如果用嵌套循环，会计算5×5 = 25次，其中15次是重复或无用的。相似度矩阵让我们可以："算一次，用一次，避免重复计算"

知识点3：python新字典语法结构：

python 复制代码

new_dict = {key_expression: value_expression for item in iterable}

# 本代码中的写法
entity_texts = {
    k: f"{v['name']} {v['type']}"
    for k, v in entities.items()
}

# 传统写法
entity_texts = {}
for k, v in entities.items():
    entity_texts[k] = f"{v['name']} {v['type']}"
    
# 结果对比
entities = {
    "entity1": {"name": "苹果公司", "type": "Company"},
    "entity2": {"name": "乔布斯", "type": "Person"}
}
entity_texts = {
    "entity1": "苹果公司 Company",
    "entity2": "乔布斯 Person"
}

k 是新字典的键（key） f"{v['name']} {v['type']}" 是新字典的值（value） for k, v in entities.items() 是遍历原字典。

下面这部分代码主要用于实体消歧过程中细致筛选过程，就是调用prompt向大模型发出指令进行处理，得到结果的过程，仍旧包含json兼容性处理操作：

scss 复制代码

def similarity_llm_single(self, entity1, entity2):
    # # 这个函数的任务：接收两个实体的详细信息，请 LLM 判断它们是否是同一个实体
    # # entity1, entity2: 两个待比较的实体字典，比如：
    # # entity1 = {"name": "苹果公司", "type": "Company", "description": "科技巨头"}
    # # entity2 = {"name": "Apple Inc.", "type": "Company", "description": "苹果公司的英文名"}
    prompt = ChatPromptTemplate.from_template(judge_sim_entity_en)   # judge_sim_entity_en跳转到prompt.py进行查看
    # 第2步：创建"工作流水线" (Chain)
    # 这里使用的是 self.similarity_model 而不是 self.model
    # 这可能是一个专门针对相似性判断任务优化过的模型，或者是同一个模型的不同配置
    # 比如，可能温度(temperature)设置得更低，让判断更加确定和一致
    chain = prompt | self.similarity_model
    # 第3步：提交鉴定申请并等待结果
    # str(entity1) 和 str(entity2) 将实体字典转换成字符串
    # 这样做的好处是，无论实体字典里有什么复杂结构，都能被完整地传递给 LLM
    # 比如 str({"name": "苹果"}) 会变成 "{'name': '苹果'}"
    result = chain.invoke({"entity1": str(entity1), "entity2": str(entity2)})
    # 第4步：解析专家的鉴定结果
    # 和之前一样的兼容性处理，确保无论 LLM 返回什么格式，我们都能提取出 JSON
    if hasattr(result, 'content'):
        result_json = json.loads(result.content)
    else:
        result_json = json.loads(result)
    return result_json

下面这段代码就是对相似实体进行处理的主过程，是"赋ID------------粗筛相似实体（候选人模式）------------llm精细筛选------------获取筛选确定的后实体------------实体合并"中的获取筛选确定的后实体的中间三个过程。

python 复制代码

def similartiy_result(self, entities):
    # 这个函数的任务：对一批实体进行完整的相似性判断流程
    # entities: 包含所有待处理实体的字典，比如：
    # {"entity1": {...}, "entity2": {...}, "entity3": {...}, ...}
    # Step 1: Use similarity_candidates for initial filtering
    candidates = self.similarity_candidates(entities)
    # --- 第1阶段：粗筛 (快速但不够精确) ---
    # 调用之前学过的 similarity_candidates 函数
    # 它会用向量相似度快速找出所有"看起来像"的实体对
    # 返回值 candidates 是一个列表，比如：[('entity1', 'entity3'), ('entity2', 'entity5')]
    # 这意味着 entity1 和 entity3 看起来很像，entity2 和 entity5 看起来很像

    # Step 2: Fine-grained LLM judgment for each candidate pair
    # --- 第2阶段：精判 (慢速但非常精确) ---
    candidates_result = []
    # 对每一对"嫌疑人"进行详细审查
    # ent_pair 是一个元组，比如 ('entity1', 'entity3')
    for ent_pair in candidates:
        # Extract entity objects from entities dictionary
        # 从实体字典中提取这两个实体的完整信息
        # ent_pair[0] 是第一个实体的ID，比如 'entity1'
        # entities.get('entity1') 会返回 {"name": "苹果公司", ...} 这样的完整字典
        entity1 = entities.get(ent_pair[0])
        entity2 = entities.get(ent_pair[1])
        
        # Call LLM for judgment
        # 请"专家证人"(LLM) 来做精确判断
        # 这里用 try-except 包裹，是非常专业的做法
        # 因为网络请求、API 调用等操作可能会失败
        try:
            result = self.similarity_llm_single(entity1, entity2)
            # 调用上面解释过的 similarity_llm_single 函数
            # 它会返回类似 {"result": true, "reason": "..."} 的判断结果
            # Keep if LLM judges as same entity
            if result.get('result', False):
                # 检查专家的判断结果
                # result.get('result', False) 的意思是：
                # - 如果 result 字典里有 'result' 这个键，就取它的值
                # - 如果没有这个键（比如 LLM 返回格式异常），就默认为 False（不相同）
                # 这是一种防御性编程，确保程序不会因为意外格式而崩溃
                candidates_result.append(ent_pair)
                # 如果专家确认这两个实体是同一个，就把这对实体记录下来
        except Exception as e:
            # 如果在判断过程中出现任何错误（网络错误、JSON解析错误等）
            # 不要让整个程序崩溃，而是记录错误，然后继续处理下一对
            print(f"Error processing entity pair {ent_pair}: {str(e)}")
            continue  # Can log or raise exception as needed
    
    # Step 3: Return final filtered candidate pairs
    return candidates_result

下面就是实体消歧中，最后合并实体的过程，主要用到了并查集（Union-Find） 算法

ini 复制代码

def entity_Disambiguation(self, entity_dic, sim_entity_list):
    # entity_dic: 所有实体的字典，如 {"entity1": {...}, "entity2": {...}}
    # sim_entity_list: 确认相同的实体对列表，如 [("entity1", "entity2"), ("entity2", "entity3")]
    # 目标：将相同的实体合并，返回清理后的字典

    # parent 字典是并查集的核心数据结构
    # 它记录每个实体的"父节点"或"代表元素"
    parent = {}

    # find 函数：找到一个实体所属群组的"族长"
    def find(x):
        # 如果 x 的父节点不是它自己，说明它不是族长
        if parent[x] != x:
            # 递归找到真正的族长
            # 同时进行"路径压缩"优化：直接让 x 指向族长，加速后续查找
            parent[x] = find(parent[x])
        return parent[x]

    # union 函数：合并两个实体所在的群组
    def union(x, y):
        # 找到 x 和 y 各自的族长
        root_x = find(x)
        root_y = find(y)
        # 如果族长不同，说明它们属于不同家族
        if root_x != root_y:
            # 让 y 的族长认 x 的族长为新族长（两个家族合并）
            parent[root_y] = root_x

    # Initialize union-find
    # 初始化：每个实体最开始都是独立的，自己就是自己的族长
    for entity in entity_dic:
        parent[entity] = entity

    # 根据相似性判断结果，执行合并操作
    for pair in sim_entity_list:
        a, b = pair
        # 安全检查：确保两个实体都存在
        if a in entity_dic and b in entity_dic:
            union(a, b)  # 将这两个实体合并到同一个家族

    # Step 2: Merge similar entities
    # groups 字典：key是族长，value是该家族的所有成员列表
    groups = {}

    for entity in entity_dic:
        # 找到这个实体的族长
        root = find(entity)
        # 如果这个族长还没有建立家族名册，就创建一个
        if root not in groups:
            groups[root] = []
        # 把这个实体加入到对应家族的名册中
        groups[root].append(entity)

    # Step 3: Process each merge group
    for group in groups.values():
        # 如果这个群组只有一个成员，不需要合并
        if len(group) == 1:
            continue

        # 选择第一个实体作为"主实体"（保留者）
        # 这里默认按出现顺序，你也可以选择其他策略（如选名字最短的）
        main_entity = group[0]

        # 使用 set 来收集所有描述和来源，自动去重
        descriptions = set()
        chunkids = set()

        # 遍历群组中的每个实体
        for e in group:
            # 收集这个实体的描述信息
            descriptions.add(entity_dic[e]['description'])
            # 收集这个实体的来源信息
            chunkids.add(entity_dic[e]['chunkid'])

            # 如果不是主实体，就从字典中删除（它的信息已被主实体吸收）
            if e != main_entity:
                del entity_dic[e]

        # 将所有收集到的信息合并到主实体中
        # 使用 ';;;' 作为分隔符，便于后续解析
        entity_dic[main_entity]['description'] = ';;;'.join(descriptions)
        entity_dic[main_entity]['chunkid'] = ';;;'.join(chunkids)

    # Step 4: Directly return merged entity dictionary
    return entity_dic

为什么这样设计？举例说明：

css 复制代码

# 初始状态：每个人都是自己的族长
parent = {"A": "A", "B": "B", "C": "C", "D": "D"}

# 执行 union("A", "B")
# A的族长是A，B的族长是B
# 合并后：B认A为族长
parent = {"A": "A", "B": "A", "C": "C", "D": "D"}

# 执行 union("B", "C")
# B的族长是A（通过find函数找到），C的族长是C
# 合并后：C认A为族长
parent = {"A": "A", "B": "A", "C": "A", "D": "D"}

# 现在 A、B、C 都在同一个家族，族长是 A

第一步：初始化并执行合并：

ini 复制代码

# 初始化：每个实体最开始都是独立的，自己就是自己的族长
for entity in entity_dic:
    parent[entity] = entity

# 根据相似性判断结果，执行合并操作
for pair in sim_entity_list:
    a, b = pair
    # 安全检查：确保两个实体都存在
    if a in entity_dic and b in entity_dic:
        union(a, b)  # 将这两个实体合并到同一个家族

第二步：收集合并群组：

ini 复制代码

# groups 字典：key是族长，value是该家族的所有成员列表
groups = {}

for entity in entity_dic:
    # 找到这个实体的族长
    root = find(entity)
    # 如果这个族长还没有建立家族名册，就创建一个
    if root not in groups:
        groups[root] = []
    # 把这个实体加入到对应家族的名册中
    groups[root].append(entity)

执行后的效果：

bash 复制代码

# 假设 entity1、entity2、entity3 是同一个实体的不同称呼
groups = {
    "entity1": ["entity1", "entity2", "entity3"],  # 一个家族
    "entity4": ["entity4"],  # 单独的个体
    "entity5": ["entity5", "entity6"]  # 另一个家族
}

第三步：处理每个合并群组

csharp 复制代码

for group in groups.values():
    # 如果这个群组只有一个成员，不需要合并
    if len(group) == 1:
        continue

    # 选择第一个实体作为"主实体"（保留者）
    # 这里默认按出现顺序，你也可以选择其他策略（如选名字最短的）
    main_entity = group[0]
    
    # 使用 set 来收集所有描述和来源，自动去重
    descriptions = set()
    chunkids = set()

    # 遍历群组中的每个实体
    for e in group:
        # 收集这个实体的描述信息
        descriptions.add(entity_dic[e]['description'])
        # 收集这个实体的来源信息
        chunkids.add(entity_dic[e]['chunkid'])
        
        # 如果不是主实体，就从字典中删除（它的信息已被主实体吸收）
        if e != main_entity:
            del entity_dic[e]

    # 将所有收集到的信息合并到主实体中
    # 使用 ';;;' 作为分隔符，便于后续解析
    entity_dic[main_entity]['description'] = ';;;'.join(descriptions)
    entity_dic[main_entity]['chunkid'] = ';;;'.join(chunkids)

结果：

makefile 复制代码

# 合并前：
entity_dic = {
    "entity1": {"name": "苹果公司", "type": "Company", 
                "description": "科技巨头", "chunkid": "chunk1"},
    "entity2": {"name": "Apple Inc.", "type": "Company", 
                "description": "iPhone制造商", "chunkid": "chunk2"},
    "entity3": {"name": "苹果", "type": "公司", 
                "description": "库比蒂诺的公司", "chunkid": "chunk3"}
}

# 合并后：
entity_dic = {
    "entity1": {"name": "苹果公司", "type": "Company", 
                "description": "科技巨头;;;iPhone制造商;;;库比蒂诺的公司",
                "chunkid": "chunk1;;;chunk2;;;chunk3"}
}
# entity2 和 entity3 已被删除，它们的信息都合并到了 entity1

下面通过测试代码，了解整个实体消歧过程，主要包括：读取处理内容------------赋ID------------粗筛候选人------------确认实体------------实体合并，这五部分功能：

python 复制代码

# test_kg_agent_stage2_real_v2.py

from src.kgAgent import NER_Agent
import json

# --- 步骤 A: 准备来自第一阶段的、最原始的输出 ---
# 这份数据不包含 chunkid，完全模拟 LLM 返回的 JSON 内容。
real_ner_output_without_id = {
    "entity1": {"name": "苹果公司", "type": "Company", "description": "一家总部位于库比蒂诺的科技巨头"},
    "entity2": {"name": "Apple Inc.", "type": "Company", "description": "苹果公司的英文名称"},
    "entity3": {"name": "史蒂夫·乔布斯", "type": "Person", "description": "苹果公司创始人"},
    "entity4": {"name": "库比蒂诺", "type": "Location", "description": "苹果公司的总部所在地"},
    "entity5": {"name": "Vision Pro", "type": "Product", "description": "苹果公司发布的一款混合现实头戴设备"},
    "entity6": {"name": "蒂姆·库克", "type": "Person", "description": "苹果公司的现任首席执行官"}
}

# 模拟这段文本所在的 chunk 的 ID
sample_chunk_id = "doc1_chunk1"

print("--- 第二阶段：实体消歧测试 (v2 - 调用 add_chunkid) ---")

print("\n正在初始化 NER_Agent...")
agent = NER_Agent()
print("Agent 初始化完成。")

# --- 步骤 B: 调用 add_chunkid 方法 ---
# 我们不再手动添加 chunkid，而是调用 agent 自己的方法来完成。
# 这完美地模拟了真实流程：先得到实体，再给它们打上来源标签。
print(f"\n正在为实体数据动态添加 chunk_id: '{sample_chunk_id}'...")
real_entities_with_id = agent.add_chunkid(real_ner_output_without_id, sample_chunk_id)

print("\n添加 chunk_id 后的待处理实体:")
print(json.dumps(real_entities_with_id, indent=2, ensure_ascii=False))

try:
    # --- 步骤 2a: 使用向量相似度进行粗筛 ---
    print("\n--- 正在运行步骤 2a: 使用 Embedding 寻找相似性候选者... ---")
    candidates = agent.similarity_candidates(real_entities_with_id, threshold=0.9)   # 阈值内置？
    print(f"找到了 {len(candidates)} 对可能需要合并的候选者: {candidates}")

    # --- 步骤 2b: 使用 LLM 进行精细判断 ---
    print("\n--- 正在运行步骤 2b: 使用 LLM 确认候选者... ---")
    confirmed_pairs = agent.similartiy_result(real_entities_with_id)
    print(f"LLM 确认了 {len(confirmed_pairs)} 对是相同的: {confirmed_pairs}")

    # --- 步骤 2c: 最终合并 ---
    print("\n--- 正在运行步骤 2c: 合并已确认的实体... ---")
    # 注意：这里传入的是带有 chunk_id 的版本，因为合并时需要用到 chunk_id
    merged_entities = agent.entity_Disambiguation(real_entities_with_id, confirmed_pairs)

    print("\n--- 消歧完成! ---")
    print("最终清理干净的实体:")
    print(json.dumps(merged_entities, indent=2, ensure_ascii=False))

except Exception as e:
    print(f"\n--- 发生错误 ---")
    print(f"错误详情: {e}")
    print("请检查你的 API Key、base_url 以及模型是否可用。")

输出结果：

lua 复制代码

--- 第二阶段：实体消歧测试 (v2 - 调用 add_chunkid) ---

正在初始化 NER_Agent...
Agent 初始化完成。

正在为实体数据动态添加 chunk_id: 'doc1_chunk1'...

添加 chunk_id 后的待处理实体:
{
  "entity1": {
    "name": "苹果公司",
    "type": "Company",
    "description": "一家总部位于库比蒂诺的科技巨头",
    "chunkid": "doc1_chunk1"
  },
  "entity2": {
    "name": "Apple Inc.",
    "type": "Company",
    "description": "苹果公司的英文名称",
    "chunkid": "doc1_chunk1"
  },
  "entity3": {
    "name": "史蒂夫·乔布斯",
    "type": "Person",
    "description": "苹果公司创始人",
    "chunkid": "doc1_chunk1"
  },
  "entity4": {
    "name": "库比蒂诺",
    "type": "Location",
    "description": "苹果公司的总部所在地",
    "chunkid": "doc1_chunk1"
  },
  "entity5": {
    "name": "Vision Pro",
    "type": "Product",
    "description": "苹果公司发布的一款混合现实头戴设备",
    "chunkid": "doc1_chunk1"
  },
  "entity6": {
    "name": "蒂姆·库克",
    "type": "Person",
    "description": "苹果公司的现任首席执行官",
    "chunkid": "doc1_chunk1"
  }
}

--- 正在运行步骤 2a: 使用 Embedding 寻找相似性候选者... ---
找到了 2 对可能需要合并的候选者: [('entity1', 'entity2'), ('entity3', 'entity6')]

--- 正在运行步骤 2b: 使用 LLM 确认候选者... ---
LLM 确认了 1 对是相同的: [('entity1', 'entity2')]

--- 正在运行步骤 2c: 合并已确认的实体... ---

--- 消歧完成! ---
最终清理干净的实体:
{
  "entity1": {
    "name": "苹果公司",
    "type": "Company",
    "description": "一家总部位于库比蒂诺的科技巨头;;;苹果公司的英文名称",
    "chunkid": "doc1_chunk1"
  },
  "entity3": {
    "name": "史蒂夫·乔布斯",
    "type": "Person",
    "description": "苹果公司创始人",
    "chunkid": "doc1_chunk1"
  },
  "entity4": {
    "name": "库比蒂诺",
    "type": "Location",
    "description": "苹果公司的总部所在地",
    "chunkid": "doc1_chunk1"
  },
  "entity5": {
    "name": "Vision Pro",
    "type": "Product",
    "description": "苹果公司发布的一款混合现实头戴设备",
    "chunkid": "doc1_chunk1"
  },
  "entity6": {
    "name": "蒂姆·库克",
    "type": "Person",
    "description": "苹果公司的现任首席执行官",
    "chunkid": "doc1_chunk1"
  }
}

手扒Github项目文档级知识图谱构建框架RAKG（保姆级）Day4

kgAgent.py 的解读

第一步：初始化并执行合并：

第二步：收集合并群组：

第三步：处理每个合并群组

`kgAgent.py` 的解读