释放你的元数据：使用 Elasticsearch 的自查询检索器

作者：来自 Elastic Josh Asres

了解如何使用 Elasticsearch 的 "self-quering" 检索器来通过结构化过滤器提高语义搜索的相关性。

在人工智能搜索的世界中，在海量的数据集中高效地找到正确的数据至关重要。传统的基于关键词的搜索在处理涉及自然语言的查询时往往会失效，这时就需要语义搜索了。然而，如果你想将语义搜索的功能与过滤日期和数字值等结构化元数据的能力结合起来，那么自查询检索器（self-querying retrievers）就可以发挥作用了。

自查询检索器提供了一种强大的方法来利用元数据进行更精确、更细致的搜索。当与 Elasticsearch 的搜索和索引功能相结合时，自查询变得更加强大，使开发人员能够提高 RAG 应用程序的相关性。这篇博文将探讨自查询检索器的概念，展示它们使用 LangChain 和 Python 与 Elasticsearch 的集成，以及它如何帮助你的搜索变得更加强大！

什么是自查询检索器（self-querying Retrievers）？

自查询检索器是 LangChain 提供的一项功能，它弥合了自然语言查询和结构化元数据过滤之间的差距。他们不再仅仅依靠关键字与文档内容的匹配，而是使用大型语言模型 (LLM) 以及 Elasticsearch 的想量搜索功能来解析用户的自然语言查询并智能地提取相关的元数据过滤器。例如，用户可能会问 "Find science fiction movies released after 2000 with a rating above 8 - 查找 2000 年后上映的评分高于 8 的科幻电影"。传统的搜索引擎如果没有关键词就很难找到隐含的含义，而单独的语义搜索可以理解查询的上下文，但无法应用日期和评级过滤器来获得最佳答案。但是，自查询检索器会分析查询，识别元数据字段（类型、年份、评级），并生成 Elasticsearch 可以理解和有效执行的结构化查询。这可以提供更加直观和用户友好的搜索体验，用户可以用简单的英语表达包含过滤器的复杂搜索条件。

所有这些都通过 LLM 链进行，其中 LLM 解析查询以从自然语言查询中提取过滤器，然后将新的结构化过滤器应用于包含 Elasticsearch 中的嵌入和元数据的文档。
Source: Langchain

实现自查询检索器

将自查询检索器与 Elasticsearch 集成涉及几个关键步骤。在我们的 Python 示例中，我们将使用 LangChain 的 AzureChatOpenAI 和 AzureOpenAIEmbeddings 以及 ElasticsearchStore 来管理它。我们首先引入所有 LangChain 库，设置 LLM 以及用于创建向量的嵌入模型：

复制代码

from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI
from langchain_elasticsearch import ElasticsearchStore
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.docstore.document import Document
import os

llm = AzureChatOpenAI(
   azure_endpoint=os.environ["AZURE_ENDPOINT"],
   deployment_name=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
   model_name="gpt-4",
   api_version="2024-02-15-preview"
)


embeddings = AzureOpenAIEmbeddings(
   azure_endpoint=os.environ["AZURE_ENDPOINT"],
   model="text-embedding-ada-002"
)

在我的示例中，我使用 Azure OpenAI 作为 LLM（gpt-4）以及使用 text-embedding-ada-002 作为嵌入。然而，这应该适用于任何基于云的 LLM 以及像 Llama 3 这样的本地 LLM，甚至适用于我使用 OpenAI 的嵌入模型，因为我们已经在使用 gpt-4。

然后，我们使用元数据定义文档，然后使用已建立的元数据字段将文档索引到 Elasticsearch 中：

复制代码

# --- Define Metadata Attributes ---
metadata_field_info = [
   AttributeInfo(
       name="year",
       description="The year the movie was released",
       type="integer",
   ),
   AttributeInfo(
       name="rating",
       description="The rating of the movie (out of 10)",
       type="float",
   ),
   AttributeInfo(
       name="genre",
       description="The genre of the movie",
       type="string",
   ),
   AttributeInfo(
       name="director",
       description="The director of the movie",
       type="string",
   ),
   AttributeInfo(
       name="title",
       description="The title of the movie",
       type="string",
   )
]

docs = [
   Document(
       page_content="Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize they are not alone.",
       metadata={"year": 2012, "rating": 7.7, "genre": "science fiction", "title": "Prometheus"},
   ),
...more documents

接下来将它们添加到 Elasticsearch 索引中，es_store.add_embeddings 函数会将文档添加到你在 ELASTIC_INDEX_NAME 变量中选择的索引中，如果在集群中找不到该索引，则会创建具有该名称的索引。在我的示例中，我使用的是 Elastic Cloud 部署，但这也适用于自管理集群：

复制代码

es_store = ElasticsearchStore(
   es_cloud_id=ELASTIC_CLOUD_ID,
   es_user=ELASTIC_USERNAME,
   es_password=ELASTIC_PASSWORD,
   index_name=ELASTIC_INDEX_NAME,
   embedding=embeddings,
)
es_store.add_embeddings(text_embeddings=list(zip(texts, doc_embeddings)), metadatas=metadatas)

然后创建自查询检索器，接受用户的查询，使用 LLM（我们之前设置的 Azure OpenAI）来解释它，然后构建一个结合语义搜索和元数据过滤器的 Elasticsearch 查询。这一切都由 docs = trieser.invoke(query) 执行：

复制代码

# --- Create the self-querying Retriever (Using your LLM) ---
retriever = SelfQueryRetriever.from_llm(
   llm,
   es_store,
   "Search for movies",
   metadata_field_info,
   verbose=True,
)

while True:
   # Prompt the user for a query
   query = input("\nEnter your search query (or type 'exit' to quit): ")
   if query.lower() == 'exit':
       break
  
   # Execute the query and print the results
   print(f"\nQuery: {query}")
   docs = retriever.invoke(query)
   print(f"Found {len(docs)} documents:")
   for doc in docs:
       print(doc.page_content)
       print(doc.metadata)
       print("-" * 20)

我们做到了！然后根据 Elasticsearch 索引执行查询，返回与内容和元数据标准最匹配的相关文档。此过程使用户能够进行自然语言查询，如下面的示例所示：

复制代码

Query: What is a highly rated movie from the 1970s?
Found 3 documents:
The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.
{'year': 1972, 'rating': 9.2, 'genre': 'crime', 'title': 'The Godfather'}
--------------------
Three men walk into the Zone, three men walk out of the Zone
{'year': 1979, 'director': 'Andrei Tarkovsky', 'genre': 'thriller', 'rating': 9.9, 'title': 'Stalker'}
--------------------
Four armed men hijack a New York City subway car and demand a ransom for the passengers
{'year': 1974, 'rating': 7.6, 'director': 'Joseph Sargent', 'genre': 'action', 'title': 'The Taking of Pelham One Two Three'}

结论

虽然自查询检索器提供了一些显著的优势，但必须考虑它们的局限性：

自查询检索器依赖于 LLM 解释用户查询和提取正确的元数据过滤器的准确性。如果查询过于模糊或元数据定义不明确，则检索器可能会产生不正确或不完整的结果。
自查询过程的性能取决于查询的复杂性和数据集的大小。对于极大的数据集或非常复杂的查询，LLM 处理和查询构建可能会带来一些开销。
应该考虑使用 LLM 进行查询解释的成本，尤其是对于高流量应用程序。

尽管存在这些考虑，自查询检索器仍然代表了信息检索的强大增强，尤其是与 Elasticsearch 的可扩展性和强大功能相结合时，为构建搜索 AI 应用程序提供了引人注目的解决方案。

有兴趣自己尝试一下吗？开始免费的云试用并在此处查看示例代码。你对可以在 Elastic Stack 中使用的其他检索器（例如 Reciprocal Rank Fusion）感兴趣吗？请参阅此处的文档以了解更多信息。

Elasticsearch 与行业领先的 Gen AI 工具和提供商进行了原生集成。查看我们的网络研讨会，了解如何超越 RAG 基础知识，或构建可用于生产的应用程序 Elastic Vector Database。

为了为你的用例构建最佳搜索解决方案，请立即开始免费云试用或在您的本地机器上试用 Elastic。

原文：Unleashing your metadata: Self-querying retrievers with Elasticsearch - Elasticsearch Labs