如何基于ElasticsearchRetriever构建RAG系统

ElasticSearch以其文本快速检索闻名,是构建文档类知识库的首选。

这里尝试基于ElasticsearchRetriever,基于langchain,构建RAG知识库系统。

1 elasticsearch

1.1 elasticsearc

elasticsearch是一款分布式的RESTful搜索分析引擎,提供了一个支持多租户的分布式全文搜索引擎,具备HTTP网络接口和无模式JSON文档存储特性,支持关键词搜索、向量搜索、混合搜索及复杂过滤功能。

这里elasticsearch通过docker安装,安装命令如下

获取ES镜像并为ES创建docker网络

docker pull docker.elastic.co/elasticsearch/elasticsearch-wolfi:9.1.3

docker network create elastic

docker pull docker.elastic.co/elasticsearch/elasticsearch:9.1.3

启动ES

docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB docker.elastic.co/elasticsearch/elasticsearch:9.1.3

具体过程参考

https://blog.csdn.net/liliang199/article/details/151581138

1.2 ElasticsearchRetriever

ElasticsearchRetriever是一个基于langchain的通用封装器,可通过Query DSL灵活访问Elasticsearch的所有功能。在大多数使用场景中,其他类(如ElasticsearchStore、ElasticsearchEmbeddings等)已能满足需求,但若遇到特殊需求时,则可选用ElasticsearchRetriever。

如需了解ElasticsearchRetriever全部功能与配置的详细说明,请参阅API参考文档

https://python.langchain.com/docs/integrations/retrievers/elasticsearch_retriever/

2 langchain

3.1 langchain

这里基于langchain集成elasticsearch,所以需要准备langchain环境,安装代码如下所示。

pip install -qU langchain-community langchain-elasticsearch -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install -qU langchain-openai -i https://pypi.tuna.tsinghua.edu.cn/simple

3.2 大模型配置

这里沿用langchain的使用习惯,使用OpenAI接口兼容的deepseek-r1大模型,具体为通过OneAPI中转调用Deepseek R1大模型。

实现过程参考

https://blog.csdn.net/liliang199/article/details/151393128

代码示例如下

复制代码
import os
os.environ['OPENAI_API_KEY'] = "sk-xxxxxxxx" # LLM模型令牌
os.environ['OPENAI_BASE_URL'] = "http://llm_model_provider_url" # LLM模型部署地址 

...

llm = ChatOpenAI(model="deepseek-r1")

3 测试验证

这里先验证ES的连接、数据导入、查询,然后通过ElasticsearchRetriever将es集成到langchain中,构建真实可运行的RAG系统。

3.1 ES连接

es_url表示ES的部署地址,passwd表示ES中elastic用户的密码。

ssl_context则表示es提供服务的证书,获取方式参考如下链接

https://blog.csdn.net/liliang199/article/details/151586083

因为使用OpenAI接口兼容的LLM模型,所以还需要配置令牌api_key和模型部署地址base_url。

示例代码如下

复制代码
import os
os.environ['OPENAI_API_KEY'] = "sk-xxxxxxxx" # LLM模型令牌
os.environ['OPENAI_BASE_URL'] = "http://llm_model_provider_url" # LLM模型部署地址 

import ssl
from typing import Any, Dict, Iterable

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from langchain_community.embeddings import DeterministicFakeEmbedding
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_elasticsearch import ElasticsearchRetriever

ssl_context = ssl.create_default_context(cafile='./http_ca.crt') # ES证书
es_url = "https://localhost:9200" # ES部署地址
passwd = "the passwd of the user elastic" # elastic用户的密码

es_client = Elasticsearch(hosts=[es_url],  basic_auth=('elastic', passwd), ssl_context=ssl_context)
es_client.info()

输出如下,说明ES连接成功。

ObjectApiResponse({'name': 'xxxx', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'xxxxxx, ..., 'tagline': 'You Know, for Search'})

3.2 数据导入

这里虚构7条测试数据texts = ["foo", "bar", "world", "hello world", "hello", "foo bar", "bla bla foo"]

并未7条测试数据创建向量索引,向量计算采用随机生成,对应DeterministicFakeEmbedding,实际环境可替换为OllamaEmbeddings。

实现代码如下所示,具体过程与批量向量化和导入ES的过程类似。

复制代码
embeddings = DeterministicFakeEmbedding(size=3)
index_name = "test-langchain-retriever_v0"
text_field = "text"
dense_vector_field = "fake_embedding"
num_characters_field = "num_characters"
texts = [
    "foo",
    "bar",
    "world",
    "hello world",
    "hello",
    "foo bar",
    "bla bla foo",
]

def create_index(
    es_client: Elasticsearch,
    index_name: str,
    text_field: str,
    dense_vector_field: str,
    num_characters_field: str,
):
    es_client.indices.create(
        index=index_name,
        mappings={
            "properties": {
                text_field: {"type": "text"},
                dense_vector_field: {"type": "dense_vector"},
                num_characters_field: {"type": "integer"},
            }
        },
    )


def index_data(
    es_client: Elasticsearch,
    index_name: str,
    text_field: str,
    dense_vector_field: str,
    embeddings: Embeddings,
    texts: Iterable[str],
    refresh: bool = True,
) -> None:
    create_index(
        es_client, index_name, text_field, dense_vector_field, num_characters_field
    )

    vectors = embeddings.embed_documents(list(texts))
    requests = [
        {
            "_op_type": "index",
            "_index": index_name,
            "_id": i,
            text_field: text,
            dense_vector_field: vector,
            num_characters_field: len(text),
        }
        for i, (text, vector) in enumerate(zip(texts, vectors))
    ]

    bulk(es_client, requests)

    if refresh:
        es_client.indices.refresh(index=index_name)

    return len(requests)


index_data(es_client, index_name, text_field, dense_vector_field, embeddings, texts)

正常情况下,输入如下所示。

7

3.3 数据查询

ES有不同的索引查询方式,比如向量查询、bm2.5查询、混合查询等,针对不同查询方式,需要构建对应的ElasticsearchRetriever。

1)向量查询

向量查询代码如下所示,主要为构建查询函数vecter_query,并构建vector_retriever。

在查询函数中,需要将问题search_query向量化,然后将获得的向量传入query_vector。

复制代码
def vector_query(search_query: str) -> Dict:
    vector = embeddings.embed_query(search_query)  # same embeddings as for indexing
    return {
        "knn": {
            "field": dense_vector_field,
            "query_vector": vector,
            "k": 5,
            "num_candidates": 10,
        }
    }


vector_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=vector_query,
    content_field=text_field,
    es_client=es_client
)

print("dd")

vector_retriever.invoke("foo")

输出如下所示

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9987202, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), ...

2)BM25查询

BM25其实就是传统字符串匹配查询,代码如下所示。

复制代码
def bm25_query(search_query: str) -> Dict:
    return {
        "query": {
            "match": {
                text_field: search_query,
            },
        },
    }


bm25_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=bm25_query,
    content_field=text_field,
    es_client=es_client
)

bm25_retriever.invoke("foo")

输出如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9711467, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.7437035, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.6025789, '_source': {'fake_embedding': \[1.7365927060137358, -0.5230400847844948, 0.7978339724186192\], 'num_characters': 11}}, page_content='bla bla foo')

3)混合查询

混合查询,就是在查询函数中,同时定义标准字符串匹配查询、knn向量查询,查询结果采用 Reciprocal Rank Fusion (RRF) 混合。

复制代码
def hybrid_query(search_query: str) -> Dict:
    vector = embeddings.embed_query(search_query)  # same embeddings as for indexing
    return {
        "retriever": {
            "rrf": {
                "retrievers": [
                    {
                        "standard": {
                            "query": {
                                "match": {
                                    text_field: search_query,
                                }
                            }
                        }
                    },
                    {
                        "knn": {
                            "field": dense_vector_field,
                            "query_vector": vector,
                            "k": 5,
                            "num_candidates": 10,
                        }
                    },
                ]
            }
        }
    }


hybrid_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=bm25_query,
    content_field=text_field,
    es_client=es_client
)

hybrid_retriever.invoke("foo")

输出示例如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9711467, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.7437035, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.6025789, '_source': {'fake_embedding': \[1.7365927060137358, -0.5230400847844948, 0.7978339724186192\], 'num_characters': 11}}, page_content='bla bla foo')

4)模糊查询

示例代码如下,就是基于 typo tolerance的字符串匹配。

复制代码
def fuzzy_query(search_query: str) -> Dict:
    return {
        "query": {
            "match": {
                text_field: {
                    "query": search_query,
                    "fuzziness": "AUTO",
                }
            },
        },
    }


fuzzy_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=fuzzy_query,
    content_field=text_field,
    es_client=es_client
)

fuzzy_retriever.invoke("fox")  # note the character tolernace

输出如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.6474311, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.49580228, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.40171927, '_source': {'fake_embedding': \[1.7365927060137358, -0.5230400847844948, 0.7978339724186192\], 'num_characters': 11}}, page_content='bla bla foo')

5)复杂过滤

定义多种过滤方式,如must、must_not、should等,以提高查询效率。

代码示例如下

复制代码
def filter_query_func(search_query: str) -> Dict:
    return {
        "query": {
            "bool": {
                "must": [
                    {"range": {num_characters_field: {"gte": 5}}},
                ],
                "must_not": [
                    {"prefix": {text_field: "bla"}},
                ],
                "should": [
                    {"match": {text_field: search_query}},
                ],
            }
        }
    }


filtering_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=filter_query_func,
    content_field=text_field,
    es_client=es_client
)

filtering_retriever.invoke("foo")

输出如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 1.7437035, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '2', '_score': 1.0, '_source': {'fake_embedding': \[-0.7041151202179595, -1.4652961969276497, -0.25786766898672847\], 'num_characters': 5}}, page_content='world'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '3', '_score': 1.0, '_source': {'fake_embedding': \[0.42728413221815387, -1.1889908285425348, -1.445433230084671\], 'num_characters': 11}}, page_content='hello world'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '4', '_score': 1.0, '_source': {'fake_embedding': \[-0.28560441330564046, 0.9958894823084921, 1.5489829880195058\], 'num_characters': 5}}, page_content='hello')

6)文档映射

将ES的查询结果映射到langchain Document中,查询函数依然采用复杂过滤filter_query_func,结果融合函数num_characters_mapper示例如下,可依据实际情况自定义。

复制代码
def num_characters_mapper(hit: Dict[str, Any]) -> Document:
    num_chars = hit["_source"][num_characters_field]
    content = hit["_source"][text_field]
    return Document(
        page_content=f"This document has {num_chars} characters",
        metadata={"text_content": content},
    )


custom_mapped_retriever = ElasticsearchRetriever.from_es_params(
    index_name=index_name,
    body_func=filter_query_func,
    document_mapper=num_characters_mapper,
    url=es_url,
)

custom_mapped_retriever.invoke("foo")

输出示例如下

Document(metadata={'text_content': 'foo bar'}, page_content='This document has 7 characters'), Document(metadata={'text_content': 'world'}, page_content='This document has 5 characters'), Document(metadata={'text_content': 'hello world'}, page_content='This document has 11 characters'), Document(metadata={'text_content': 'hello'}, page_content='This document has 5 characters')

3.4 langchian

这里基于之前验证的ElasticsearchRetriever,结合ChatOpenAI自定义大模型,构建一个完整的langchain RAG系统,chain定义如下

chain = (

{"context": vector_retriever | format_docs, "question": RunnablePassthrough()}

| prompt

| llm

| StrOutputParser()

)

整体代码示例如下

复制代码
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_template(
    """Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)

llm = ChatOpenAI(model="deepseek-r1")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

chain = (
    {"context": vector_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

然后,启动chain,完整真实的RAG知识库检索。

复制代码
chain.invoke("what is foo?")

输出如下所示,可见这里不仅实现了检索功能,而且依据大模型对检索结果进行了有效的处理。

'Based on the provided context, "foo" appears in two instances:\n1. In the line: "bla bla foo"\n2. In the line: "foo bar"\n\nThe context does not explicitly define what "foo" is, but it is used as part of example text alongside other placeholder terms like "bla," "bar," "hello," and "world." No further explanation or meaning is given for "foo" in the context.'

reference


ElasticsearchRetriever

https://python.langchain.com/docs/integrations/retrievers/elasticsearch_retriever/

Mac本地docker安装Kibana+ElasticSearch

https://blog.csdn.net/liliang199/article/details/151581138

python访问基于docker搭建的elasticsearch

https://blog.csdn.net/liliang199/article/details/151586083

ElasticsearchRetriever构建参数说明

https://python.langchain.com/api_reference/elasticsearch/retrievers/langchain_elasticsearch.retrievers.ElasticsearchRetriever.html

python访问基于docker搭建的elasticsearch

https://blog.csdn.net/liliang199/article/details/151586083

Reciprocal rank fusion

https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion

OneAPI-通过OpenAI API访问所有大模型

https://blog.csdn.net/liliang199/article/details/151393128

ChatOpenAI

https://python.langchain.com/api_reference/community/chat_models/langchain_community.chat_models.openai.ChatOpenAI.html#langchain_community.chat_models.openai.ChatOpenAI

相关推荐
llilian_165 分钟前
相位差测量仪 高精度相位计相位差测量仪的应用 相位计
大数据·人工智能·功能测试·单片机
百家方案16 分钟前
“十五五”智慧文旅解决方案:以科技为核心,开启沉浸体验与高效治理新篇章
大数据·人工智能·智慧文旅·智慧旅游
专注数据的痴汉19 分钟前
「数据获取」吉林地理基础数据(道路、水系、四级行政边界、地级城市、DEM等)
大数据·人工智能·信息可视化
知识分享小能手34 分钟前
Ubuntu入门学习教程,从入门到精通,Ubuntu 22.04 中的大数据 —— 知识点详解 (24)
大数据·学习·ubuntu
城数派1 小时前
2019-2025年各区县逐月新房房价数据(Excel/Shp格式)
大数据·数据分析·excel
专注数据的痴汉1 小时前
「数据获取」中国会计年鉴(1996-2024)
大数据·人工智能·信息可视化
智慧化智能化数字化方案1 小时前
【精品资料鉴赏】详解企业研发生产一体化总体规划建设方案
大数据·人工智能·企业研发生产一体化·企业如何开展数字化转型·企业数字化营销·数字化转型咨询规划·数字化转型架构
奕成则成1 小时前
Flink全面入门指南:从基础认知到BI数据仓库实践
大数据·数据仓库·flink
liuyunshengsir1 小时前
Elasticsearch 高级查询must 多个条件同时满足
linux·服务器·elasticsearch
HZZD_HZZD1 小时前
喜讯|合众致达成功中标宁夏宝丰集团水电表计量结算管理平台项目
大数据·人工智能