LangExtract + 知识图谱 — Google 用于 NLP 任务的新库

原文摘要：LangExtract 是 Google 于 2025-07-30 开源的"程序化抽取"工具，面向邮件、报告、病历等非结构化文本，精确抽取所需信息并将每个抽取结果与原文字符偏移（offset）绑定，实现可追溯、可高亮验证的结构化输出。其核心能力包括：对长文档的分块与并行处理、多轮抽取确保召回、直接生成结构化结果以减少传统 RAG 中的切分与向量嵌入开销；可同时兼容云端大模型（如 Gemini）与本地开源模型，并支持自定义提示模板以适配不同领域。本文提供一个以 Streamlit 为界面、Agraph 为可视化、LangExtract 为抽取核心的"知识图谱 + 聊天机器人"最小示范：动态 few-shot 依据关键词自动选择模板，先并行抽取实体与关系，再构建节点与边；若未检测到显式关系，回退以"related_to"边保持图连通；最后支持查询过滤并在多页签展示图、实体、关系与查询结果。文章提醒：单一工具不可包打天下，需组合使用并通过反馈迭代提升质量。

1 主题导入

在这个快速演示中，我将展示如何用 LangExtract 构建一个知识图谱，并将其与聊天机器人结合，为企业或个人打造实用的问答系统。

在数据驱动的今天，许多有价值的信息潜藏在非结构化文本中（如临床记录、冗长的法律合同、用户反馈话题）。从这些文档中抽取"有意义且可追溯"的信息一直是技术与实践上的双重挑战。

2025 年 7 月 30 日，Google 发布开源 AI 项目 LangExtract。该工具能从我们每天会读到的文本（邮件、报告、病历等）中"只抽取必要信息"，并组织成计算机易处理的格式。

尽管 AI 很有用，但也有短板：可能产生幻觉、提供错误信息、一次可保留的上下文有限、同样问题可能每次回答不同。LangExtract 充当一座"智能桥梁"，弥补上述弱点，把"理解文本"的能力转化为"抽取可靠信息"的能力。

接下来给出一个在线聊天机器人的快速演示来说明这一点。

示例问题："Apple Inc. was founded by Steve Jobs and Steve Wozniak in 1976. The company is headquartered in Cupertino, California. Steve Jobs served as CEO until he died in 2011."

在该系统的输出生成过程中，智能体会用 document_extractor_tool 抽取实体：它调用 LangExtract，并基于动态 few-shot 示例按查询关键词自动选择合适的抽取模板------例如检测到 "financial""revenue""company" 等关键词时将应用面向业务的示例，将公司名、人物、地点与日期等正确归类而非落入通用类别。实体抽取与关系抽取并行运行：系统通过文档上下文识别"创立于""总部位于""与...竞争"等关系。两者完成后，build_graph_data 会创建图结构：为每个唯一实体创建节点、为每个发现的关系创建边；若未检测到显式关系，则通过"related_to"作为稳健回退确保连通。

最终用 Streamlit Agraph 渲染交互式知识图谱，用户可探索公司、创始人、地点等之间的连接，系统在内存中运行、无需文件操作，并提供实时调试信息（实体与关系数量），支持针对科技公司与其相互关系的查询与结果过滤。

2 什么是 LangExtract？

LangExtract 是 Google 最新开源、公开可用的抽取特性，有望为开发者与数据团队"重建理性秩序"。它不只是"用 AI 抽信息"，而是将每次抽取与原文绑定。LangExtract 作为构建在 LLM 之上的"特殊机制"，针对信息抽取中的常见挑战（幻觉、不精确、有限上下文窗口、非确定性）最大化 LLM 的可用性。

2.1 LangExtract 有何特别之处？

LangExtract 的核心优势是 "程序化抽取"：不仅精准识别所需信息，还能把每条抽取结果链接到原文的确切字符位置（offset）。这种可追溯性允许对结果进行高亮与验证，显著提升交互式的数据可靠性。

LangExtract 具备一系列强大功能：可通过分块、并行计算与多轮抽取高效处理"百万 token 级"的长文档以保证召回；直接生成结构化输出，从而无需传统 RAG 工作流里的切分与嵌入；同时兼容云端模型（如 Gemini）与本地开源大模型，并支持自定义提示模板，轻松适配不同领域。

2.2 开始编码

我们逐步探索如何用 LangExtract 构建知识图谱式的聊天机器人。首先安装所需库：

复制代码

# 功能：安装依赖
# 说明：按原文示例，使用 requirements 列表安装依赖（原文命令含逗号拼写）
pip install -r requirements,txt

下一步是常规导入相关库。langextract 是一个 Python 库，基于用户定义的指令，利用 LLM 从非结构化文本中抽取结构化信息；streamlit_agraph 是 Streamlit 的自定义组件，用于交互式图可视化。

复制代码

# 功能：导入依赖库
import os
import textwrap
import langextract as lx
import logging
import streamlit as st
from streamlit_agraph import Config, Edge, Node, agraph
from typing import List, Dict, Any, Optional
import json

现在创建 document_extractor_tool。该函数接收两段字符串：unstructured_text 与 user_query，返回一个 Python 字典，便于后续转为 JSON。内部先用 textwrap.dedent(...) 构造干净的提示词，明确模型的角色（抽取专家）、任务（抽取相关信息）与需关注的具体查询。随后准备 few-shot 示例来引导抽取：检测关键词后（财务、法律、社交/餐饮等），自动挑选对应示例以确保模型理解输出结构与抽取方式；若不匹配则使用通用示例。最后调用 lx.extract(...)，传入文本、提示、示例与保存在环境变量中的 API Key；记录日志便于调试，并规范化输出为包含 text、class 与 attributes 的字典列表。

复制代码

# 功能：基于用户查询的结构化信息抽取
# 说明：动态 few-shot 模板选择 + LangExtract 抽取 + 统一字典化输出
def document_extractor_tool(unstructured_text: str, user_query: str) -> dict:
    """
    从给定非结构化文本中，依据用户查询抽取结构化信息。
    返回包含抽取结果字典列表的对象，便于 JSON 化与下游处理。
    """
    prompt = textwrap.dedent(f"""
    You are an expert at extracting specific information from documents.
    Based on the user's query, extract the relevant information from the provided text.
    The user's query is: "{user_query}"
    Provide the output in a structured JSON format.
    """)

    
    examples = []
    query_lower = user_query.lower()
    if any(keyword in query_lower for keyword in ["financial", "revenue", "company", "fiscal"]):
        financial_example = lx.data.ExampleData(
            text="In Q1 2023, Innovate Inc. reported a revenue of $15 million.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="company_name",
                    extraction_text="Innovate Inc.",
                    attributes={"name": "Innovate Inc."},
                ),
                lx.data.Extraction(
                    extraction_class="revenue",
                    extraction_text="$15 million",
                    attributes={"value": 15000000, "currency": "USD"},
                ),
                lx.data.Extraction(
                    extraction_class="fiscal_period",
                    extraction_text="Q1 2023",
                    attributes={"period": "Q1 2023"},
                ),
            ]
        )
        examples.append(financial_example)
    elif any(keyword in query_lower for keyword in ["legal", "agreement", "parties", "effective date"]):
        legal_example = lx.data.ExampleData(
            text="This agreement is between John Doe and Jane Smith, effective 2024-01-01.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="party",
                    extraction_text="John Doe",
                    attributes={"name": "John Doe"},
                ),
                lx.data.Extraction(
                    extraction_class="party",
                    extraction_text="Jane Smith",
                    attributes={"name": "Jane Smith"},
                ),
                lx.data.Extraction(
                    extraction_class="effective_date",
                    extraction_text="2024-01-01",
                    attributes={"date": "2024-01-01"},
                ),
            ]
        )
        examples.append(legal_example)
    elif any(keyword in query_lower for keyword in ["social", "post", "feedback", "restaurant", "菜式", "評價"]):
        social_media_example = lx.data.ExampleData(
            text="I tried the new 'Taste Lover' restaurant in TST today. The black truffle risotto was amazing, but the Tiramisu was just average.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="restaurant_name",
                    extraction_text="Taste Lover",
                    attributes={"name": "Taste Lover"},
                ),
                lx.data.Extraction(
                    extraction_class="dish",
                    extraction_text="black truffle risotto",
                    attributes={"name": "black truffle risotto", "sentiment": "positive"},
                ),
                lx.data.Extraction(
                    extraction_class="dish",
                    extraction_text="Tiramisu",
                    attributes={"name": "Tiramisu", "sentiment": "neutral"},
                ),
            ]
        )
        examples.append(social_media_example)
    else:
        
        generic_example = lx.data.ExampleData(
            text="Juliet looked at Romeo with a sense of longing.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="character", extraction_text="Juliet", attributes={"name": "Juliet"}
                ),
                lx.data.Extraction(
                    extraction_class="character", extraction_text="Romeo", attributes={"name": "Romeo"}
                ),
                lx.data.Extraction(
                    extraction_class="emotion", extraction_text="longing", attributes={"type": "longing"}
                ),
            ]
        )
        examples.append(generic_example)

    logging.info(f"Selected {len(examples)} few-shot example(s).")

    result = lx.extract(
        text_or_documents=unstructured_text,
        prompt_description=prompt,
        examples=examples,
        api_key=os.getenv("GOOGLE_API_KEY")
    )
    
    logging.info(f"Extraction result: {result}")

    
    extractions = [
        {"text": e.extraction_text, "class": e.extraction_class, "attributes": e.attributes}
        for e in result.extractions
    ]
    
    return {
        "extracted_data": extractions
    }

接下来是 load_gemini_key()，它返回一个二元组：API Key（str）与"是否已提供 Key"的标记（bool）。它先尝试从 .streamlit/secrets.toml 中读取 GOOGLE_API_KEY；若存在，则在侧边栏提示成功并置标记为 True；否则在侧边栏以"密码输入框"方式请求用户输入；若用户输入则提示成功并置标记为 True，否则提示错误。

原文代码片段在此处截断，保留其结构意图：

复制代码

# 功能：加载 Gemini API Key（示意片段，原文代码截断）
# 说明：优先读取 .streamlit/secrets.toml，其次侧边栏交互输入，设置标记
# Streamlit utility functions
def load_gemini_key() -> tuple[str, bool]:
    """Load the Gemini API key from the environment variable or user input."""
    key = ""
    is_key_provided = False
    secrets_file = os.path.join(".streamlit", "secrets.toml")
    if os.path.exists(secrets_file) and "GOOGLE_API_KEY" in st.secrets.keys():
        key = st.secrets["GOOGLE_API_KEY"]
        st.sidebar.success(
        is_key_provided = True
    else:
        key = st.sidebar.text_input(
            
        if len(key) > 0:

随后定义两个可视化相关函数：format_output_agraph(output) 将原始图数据转换为 Agraph 的 Node 与 Edge 列表；
display_agraph(nodes, edges) 设置图的宽高、布局、物理引擎、层级结构等，并渲染在 Streamlit 中。

复制代码

# 功能：Agraph 图数据格式化与展示
def format_output_agraph(output):
    nodes = []
    edges = []
    for node in output["nodes"]:
        nodes.append(
            Node(id=node["id"], label=node["label"], size=8, shape="diamond"))
    for edge in output["edges"]:
        edges.append(Edge(source=edge["source"], label=edge["relation"],
                     target=edge["target"], color="#4CAF50", arrows="to"))
    return nodes, edges

def display_agraph(nodes, edges):
    config = Config(width=950,
                    height=950,
                    directed=True,
                    physics=True,
                    hierarchical=True,
                    nodeHighlightBehavior=False,
                    highlightColor="#F7A7A6",
                    collapsible=False,
                    node={'labelProperty': 'label'},
                    )
    return agraph(nodes=nodes, edges=edges, config=config)

然后是两组核心抽取函数：extract_entities(documents) 针对每个文档调用 document_extractor_tool 抽取公司名、收入与会计期间等财务实体，并汇总；extract_relationships(documents) 针对每个文档抽取公司与期间之间的收入等关系并汇总。

复制代码

# 功能：GraphRAG 抽取实体与关系（财务示例）
def extract_entities(documents: List[str]) -> List[Dict[str, Any]]:
    """Extract entities from documents"""
    all_entities = []
    
    for doc in documents:
        result = document_extractor_tool(
            doc, 
            "Extract financial entities including company names, revenue figures, and fiscal periods from business documents"
        )
        all_entities.extend(result["extracted_data"])
    
    return all_entities

def extract_relationships(documents: List[str]) -> List[Dict[str, Any]]:
    """Extract relationships between entities"""
    all_relationships = []
    
    for doc in documents:
        result = document_extractor_tool(
            doc,
            "Extract financial relationships and revenue connections between companies and fiscal periods"
        )
        all_relationships.extend(result["extracted_data"])
    
    return all_relationships

接着构建图数据：build_graph_data(entities, relationships) 将实体转换为节点并建立"文本到节点 ID"的映射；遍历关系文本以匹配其中出现的实体，并在其间创建边；若未成功识别显式关系，则以"所有实体之间的共现关系"为回退策略保持连通。answer_query(entities, relationships, query) 则根据查询词对实体与关系进行匹配过滤，返回相关项与计数。

复制代码

# 功能：图数据构建与查询回答
def build_graph_data(entities: List[Dict[str, Any]], relationships: List[Dict[str, Any]]) -> Dict[str, Any]:
    """Build graph data for visualization"""
    nodes = []
    edges = []
    
    # Create nodes from entities
    entity_map = {}
    for i, entity in enumerate(entities):
        node_id = str(i)
        nodes.append({
            "id": node_id,
            "label": entity["text"],
            "type": entity["class"]
        })
        entity_map[entity["text"].lower()] = node_id
    
    # Create edges from relationships and simple co-occurrence
    for rel in relationships:
        rel_text = rel["text"].lower()
        found_entities = []
        
        # Find entities mentioned in this relationship
        for entity_text, entity_id in entity_map.items():
            if entity_text in rel_text:
                found_entities.append(entity_id)
        
        # Create edges between found entities
        for i in range(len(found_entities)):
            for j in range(i + 1, len(found_entities)):
                edges.append({
                    "source": found_entities[i],
                    "target": found_entities[j],
                    "relation": rel["class"]
                })
    
    # If no relationships found, create simple co-occurrence edges
    if not edges:
        st.write("No relationship edges found, creating fallback edges...")
        for i, entity1 in enumerate(entities):
            for j, entity2 in enumerate(entities):
                if i < j:
                    # Create edges between all entities
                    edges.append({
                        "source": str(i),
                        "target": str(j),
                        "relation": "related_to"
                    })
    
    return {"nodes": nodes, "edges": edges}

def answer_query(entities: List[Dict[str, Any]], relationships: List[Dict[str, Any]], query: str) -> Dict[str, Any]:
    """Answer query using extracted entities and relationships"""
    if not query:
        return None
    
    # Find relevant entities
    relevant_entities = [
        e for e in entities 
        if any(word.lower() in e["text"].lower() or word.lower() in str(e["attributes"]).lower() 
               for word in query.split())
    ]
    
    # Find relevant relationships
    relevant_relationships = [
        r for r in relationships
        if any(word.lower() in r["text"].lower() or word.lower() in str(r["attributes"]).lower()
               for word in query.split())
    ]
    
    return {
        "query": query,
        "relevant_entities": relevant_entities,
        "relevant_relationships": relevant_relationships,
        "entity_count": len(relevant_entities),
        "relationship_count": len(relevant_relationships)
    }

然后是主处理管线：process_documents(documents, query=None) 先调用前述两函数抽取所有实体与关系，并打印调试信息；随后传入 build_graph_data 构建节点与边，再次打印图大小；若提供查询则调用 answer_query 返回相关实体与关系。该函数返回一个包含实体、关系、图数据与查询结果的综合字典，便于可视化或进一步分析。

复制代码

# 功能：处理文档并可选回答查询
def process_documents(documents: List[str], query: str = None) -> Dict[str, Any]:
    """Process documents and optionally answer a query"""
    
    entities = extract_entities(documents)
    relationships = extract_relationships(documents)
    
    
    st.write(f"Debug: Found {len(entities)} entities, {len(relationships)} relationships")
    
    
    graph_data = build_graph_data(entities, relationships)
    
    
    st.write(f"Debug: Graph has {len(graph_data['nodes'])} nodes, {len(graph_data['edges'])} edges")
    
    
    results = answer_query(entities, relationships, query) if query else None
    
    return {
        "entities": entities,
        "relationships": relationships,
        "graph_data": graph_data,
        "results": results
    }

最后是应用入口：设置页面标题与布局，加载 Gemini API Key（若未提供则警告并中止），将 Key 写入环境，定义预置的科技公司文本列表并提示数量，提供可选查询框，点击按钮后执行 process_documents。结果以四个页签呈现：Graph Visualization（图可视化） 、Entities（实体） 、Relationships（关系） 、Query Results（查询结果） ；其中图页签调用 format_output_agraph 与 display_agraph 渲染交互式知识图谱，其他页签展示可展开的 JSON 明细。

复制代码

# 功能：Streamlit 应用入口，组织 UI 与交互
def main():
    st.set_page_config(page_title="GraphRAG with LangExtract", layout="wide")
    st.title("GraphRAG with LangExtract")
    
    
    api_key, is_key_provided = load_gemini_key()
    
    if not is_key_provided:
        st.warning("Please provide an API key to continue")
        return
    
    
    os.environ["GOOGLE_API_KEY"] = api_key
    
    
    documents = [
        "Apple Inc. was founded by Steve Jobs and Steve Wozniak in 1976. The company is headquartered in Cupertino, California. Steve Jobs served as CEO until his death in 2011.",
        "Microsoft Corporation was founded by Bill Gates and Paul Allen in 1975. It's based in Redmond, Washington. Bill Gates was the CEO for many years.",
        "Both Apple and Microsoft are major technology companies that compete in various markets including operating systems and productivity software. They have a long history of rivalry.",
        "Google was founded by Larry Page and Sergey Brin in 1998. The company started as a search engine but has expanded into many areas including cloud computing and artificial intelligence."
    ]
    
    st.success(f"Using {len(documents)} predefined documents about tech companies")
        
    
    query = st.text_input("Enter your query (optional):")
    
    if st.button("Process Documents"):
        with st.spinner("Processing documents..."):
            result = process_documents(documents, query if query else None)
            
            
            tab1, tab2, tab3, tab4 = st.tabs(["Graph Visualization", "Entities", "Relationships", "Query Results"])
            
            with tab1:
                if result["graph_data"]:
                    st.subheader("Knowledge Graph")
                    
                    nodes, edges = format_output_agraph(result["graph_data"])
                    if nodes:
                        display_agraph(nodes, edges)
                    else:
                        st.info("No graph data to display")
            
            with tab2:
                st.subheader("Extracted Entities")
                if result["entities"]:
                    for i, entity in enumerate(result["entities"]):
                        with st.expander(f"{entity['text']} ({entity['class']})"):
                            st.json(entity["attributes"])
                else:
                    st.info("No entities extracted")
            
            with tab3:
                st.subheader("Extracted Relationships")
                if result["relationships"]:
                    for i, rel in enumerate(result["relationships"]):
                        with st.expander(f"{rel['text']} ({rel['class']})"):
                            st.json(rel["attributes"])
                else:
                    st.info("No relationships extracted")
            
            with tab4:
                if query and result["results"]:
                    st.subheader("Query Results")
                    st.json(result["results"])
                else:
                    st.info("No query provided or no results")

if __name__ == "__main__":
    main()

3 结论

LangExtract 并不能单枪匹马解决所有问题，但新的 AI 工具需要被持续开发与发布。将多种工具组合使用，能更好地暴露各自的不足，推动进一步改进。近年来 AI 取得显著进步，这背后离不开大量用户的反馈与实践。使用 AI 并不存在绝对的"失败"，先试起来、并亲自构建适合自己的 AI，是一个不错的思路。