如何基于ElasticsearchRetriever构建RAG系统

ElasticSearch以其文本快速检索闻名,是构建文档类知识库的首选。

这里尝试基于ElasticsearchRetriever,基于langchain,构建RAG知识库系统。

1 elasticsearch

1.1 elasticsearc

elasticsearch是一款分布式的RESTful搜索分析引擎,提供了一个支持多租户的分布式全文搜索引擎,具备HTTP网络接口和无模式JSON文档存储特性,支持关键词搜索、向量搜索、混合搜索及复杂过滤功能。

这里elasticsearch通过docker安装,安装命令如下

获取ES镜像并为ES创建docker网络

docker pull docker.elastic.co/elasticsearch/elasticsearch-wolfi:9.1.3

docker network create elastic

docker pull docker.elastic.co/elasticsearch/elasticsearch:9.1.3

启动ES

docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB docker.elastic.co/elasticsearch/elasticsearch:9.1.3

具体过程参考

https://blog.csdn.net/liliang199/article/details/151581138

1.2 ElasticsearchRetriever

ElasticsearchRetriever是一个基于langchain的通用封装器,可通过Query DSL灵活访问Elasticsearch的所有功能。在大多数使用场景中,其他类(如ElasticsearchStore、ElasticsearchEmbeddings等)已能满足需求,但若遇到特殊需求时,则可选用ElasticsearchRetriever。

如需了解ElasticsearchRetriever全部功能与配置的详细说明,请参阅API参考文档

https://python.langchain.com/docs/integrations/retrievers/elasticsearch_retriever/

2 langchain

3.1 langchain

这里基于langchain集成elasticsearch,所以需要准备langchain环境,安装代码如下所示。

pip install -qU langchain-community langchain-elasticsearch -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install -qU langchain-openai -i https://pypi.tuna.tsinghua.edu.cn/simple

3.2 大模型配置

这里沿用langchain的使用习惯,使用OpenAI接口兼容的deepseek-r1大模型,具体为通过OneAPI中转调用Deepseek R1大模型。

实现过程参考

https://blog.csdn.net/liliang199/article/details/151393128

代码示例如下

复制代码
import os
os.environ['OPENAI_API_KEY'] = "sk-xxxxxxxx" # LLM模型令牌
os.environ['OPENAI_BASE_URL'] = "http://llm_model_provider_url" # LLM模型部署地址 

...

llm = ChatOpenAI(model="deepseek-r1")

3 测试验证

这里先验证ES的连接、数据导入、查询,然后通过ElasticsearchRetriever将es集成到langchain中,构建真实可运行的RAG系统。

3.1 ES连接

es_url表示ES的部署地址,passwd表示ES中elastic用户的密码。

ssl_context则表示es提供服务的证书,获取方式参考如下链接

https://blog.csdn.net/liliang199/article/details/151586083

因为使用OpenAI接口兼容的LLM模型,所以还需要配置令牌api_key和模型部署地址base_url。

示例代码如下

复制代码
import os
os.environ['OPENAI_API_KEY'] = "sk-xxxxxxxx" # LLM模型令牌
os.environ['OPENAI_BASE_URL'] = "http://llm_model_provider_url" # LLM模型部署地址 

import ssl
from typing import Any, Dict, Iterable

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from langchain_community.embeddings import DeterministicFakeEmbedding
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_elasticsearch import ElasticsearchRetriever

ssl_context = ssl.create_default_context(cafile='./http_ca.crt') # ES证书
es_url = "https://localhost:9200" # ES部署地址
passwd = "the passwd of the user elastic" # elastic用户的密码

es_client = Elasticsearch(hosts=[es_url],  basic_auth=('elastic', passwd), ssl_context=ssl_context)
es_client.info()

输出如下,说明ES连接成功。

ObjectApiResponse({'name': 'xxxx', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'xxxxxx, ..., 'tagline': 'You Know, for Search'})

3.2 数据导入

这里虚构7条测试数据texts = ["foo", "bar", "world", "hello world", "hello", "foo bar", "bla bla foo"]

并未7条测试数据创建向量索引,向量计算采用随机生成,对应DeterministicFakeEmbedding,实际环境可替换为OllamaEmbeddings。

实现代码如下所示,具体过程与批量向量化和导入ES的过程类似。

复制代码
embeddings = DeterministicFakeEmbedding(size=3)
index_name = "test-langchain-retriever_v0"
text_field = "text"
dense_vector_field = "fake_embedding"
num_characters_field = "num_characters"
texts = [
    "foo",
    "bar",
    "world",
    "hello world",
    "hello",
    "foo bar",
    "bla bla foo",
]

def create_index(
    es_client: Elasticsearch,
    index_name: str,
    text_field: str,
    dense_vector_field: str,
    num_characters_field: str,
):
    es_client.indices.create(
        index=index_name,
        mappings={
            "properties": {
                text_field: {"type": "text"},
                dense_vector_field: {"type": "dense_vector"},
                num_characters_field: {"type": "integer"},
            }
        },
    )


def index_data(
    es_client: Elasticsearch,
    index_name: str,
    text_field: str,
    dense_vector_field: str,
    embeddings: Embeddings,
    texts: Iterable[str],
    refresh: bool = True,
) -> None:
    create_index(
        es_client, index_name, text_field, dense_vector_field, num_characters_field
    )

    vectors = embeddings.embed_documents(list(texts))
    requests = [
        {
            "_op_type": "index",
            "_index": index_name,
            "_id": i,
            text_field: text,
            dense_vector_field: vector,
            num_characters_field: len(text),
        }
        for i, (text, vector) in enumerate(zip(texts, vectors))
    ]

    bulk(es_client, requests)

    if refresh:
        es_client.indices.refresh(index=index_name)

    return len(requests)


index_data(es_client, index_name, text_field, dense_vector_field, embeddings, texts)

正常情况下,输入如下所示。

7

3.3 数据查询

ES有不同的索引查询方式,比如向量查询、bm2.5查询、混合查询等,针对不同查询方式,需要构建对应的ElasticsearchRetriever。

1)向量查询

向量查询代码如下所示,主要为构建查询函数vecter_query,并构建vector_retriever。

在查询函数中,需要将问题search_query向量化,然后将获得的向量传入query_vector。

复制代码
def vector_query(search_query: str) -> Dict:
    vector = embeddings.embed_query(search_query)  # same embeddings as for indexing
    return {
        "knn": {
            "field": dense_vector_field,
            "query_vector": vector,
            "k": 5,
            "num_candidates": 10,
        }
    }


vector_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=vector_query,
    content_field=text_field,
    es_client=es_client
)

print("dd")

vector_retriever.invoke("foo")

输出如下所示

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9987202, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), ...

2)BM25查询

BM25其实就是传统字符串匹配查询,代码如下所示。

复制代码
def bm25_query(search_query: str) -> Dict:
    return {
        "query": {
            "match": {
                text_field: search_query,
            },
        },
    }


bm25_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=bm25_query,
    content_field=text_field,
    es_client=es_client
)

bm25_retriever.invoke("foo")

输出如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9711467, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.7437035, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.6025789, '_source': {'fake_embedding': \[1.7365927060137358, -0.5230400847844948, 0.7978339724186192\], 'num_characters': 11}}, page_content='bla bla foo')

3)混合查询

混合查询,就是在查询函数中,同时定义标准字符串匹配查询、knn向量查询,查询结果采用 Reciprocal Rank Fusion (RRF) 混合。

复制代码
def hybrid_query(search_query: str) -> Dict:
    vector = embeddings.embed_query(search_query)  # same embeddings as for indexing
    return {
        "retriever": {
            "rrf": {
                "retrievers": [
                    {
                        "standard": {
                            "query": {
                                "match": {
                                    text_field: search_query,
                                }
                            }
                        }
                    },
                    {
                        "knn": {
                            "field": dense_vector_field,
                            "query_vector": vector,
                            "k": 5,
                            "num_candidates": 10,
                        }
                    },
                ]
            }
        }
    }


hybrid_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=bm25_query,
    content_field=text_field,
    es_client=es_client
)

hybrid_retriever.invoke("foo")

输出示例如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9711467, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.7437035, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.6025789, '_source': {'fake_embedding': \[1.7365927060137358, -0.5230400847844948, 0.7978339724186192\], 'num_characters': 11}}, page_content='bla bla foo')

4)模糊查询

示例代码如下,就是基于 typo tolerance的字符串匹配。

复制代码
def fuzzy_query(search_query: str) -> Dict:
    return {
        "query": {
            "match": {
                text_field: {
                    "query": search_query,
                    "fuzziness": "AUTO",
                }
            },
        },
    }


fuzzy_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=fuzzy_query,
    content_field=text_field,
    es_client=es_client
)

fuzzy_retriever.invoke("fox")  # note the character tolernace

输出如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.6474311, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.49580228, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.40171927, '_source': {'fake_embedding': \[1.7365927060137358, -0.5230400847844948, 0.7978339724186192\], 'num_characters': 11}}, page_content='bla bla foo')

5)复杂过滤

定义多种过滤方式,如must、must_not、should等,以提高查询效率。

代码示例如下

复制代码
def filter_query_func(search_query: str) -> Dict:
    return {
        "query": {
            "bool": {
                "must": [
                    {"range": {num_characters_field: {"gte": 5}}},
                ],
                "must_not": [
                    {"prefix": {text_field: "bla"}},
                ],
                "should": [
                    {"match": {text_field: search_query}},
                ],
            }
        }
    }


filtering_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=filter_query_func,
    content_field=text_field,
    es_client=es_client
)

filtering_retriever.invoke("foo")

输出如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 1.7437035, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '2', '_score': 1.0, '_source': {'fake_embedding': \[-0.7041151202179595, -1.4652961969276497, -0.25786766898672847\], 'num_characters': 5}}, page_content='world'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '3', '_score': 1.0, '_source': {'fake_embedding': \[0.42728413221815387, -1.1889908285425348, -1.445433230084671\], 'num_characters': 11}}, page_content='hello world'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '4', '_score': 1.0, '_source': {'fake_embedding': \[-0.28560441330564046, 0.9958894823084921, 1.5489829880195058\], 'num_characters': 5}}, page_content='hello')

6)文档映射

将ES的查询结果映射到langchain Document中,查询函数依然采用复杂过滤filter_query_func,结果融合函数num_characters_mapper示例如下,可依据实际情况自定义。

复制代码
def num_characters_mapper(hit: Dict[str, Any]) -> Document:
    num_chars = hit["_source"][num_characters_field]
    content = hit["_source"][text_field]
    return Document(
        page_content=f"This document has {num_chars} characters",
        metadata={"text_content": content},
    )


custom_mapped_retriever = ElasticsearchRetriever.from_es_params(
    index_name=index_name,
    body_func=filter_query_func,
    document_mapper=num_characters_mapper,
    url=es_url,
)

custom_mapped_retriever.invoke("foo")

输出示例如下

Document(metadata={'text_content': 'foo bar'}, page_content='This document has 7 characters'), Document(metadata={'text_content': 'world'}, page_content='This document has 5 characters'), Document(metadata={'text_content': 'hello world'}, page_content='This document has 11 characters'), Document(metadata={'text_content': 'hello'}, page_content='This document has 5 characters')

3.4 langchian

这里基于之前验证的ElasticsearchRetriever,结合ChatOpenAI自定义大模型,构建一个完整的langchain RAG系统,chain定义如下

chain = (

{"context": vector_retriever | format_docs, "question": RunnablePassthrough()}

| prompt

| llm

| StrOutputParser()

)

整体代码示例如下

复制代码
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_template(
    """Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)

llm = ChatOpenAI(model="deepseek-r1")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

chain = (
    {"context": vector_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

然后,启动chain,完整真实的RAG知识库检索。

复制代码
chain.invoke("what is foo?")

输出如下所示,可见这里不仅实现了检索功能,而且依据大模型对检索结果进行了有效的处理。

'Based on the provided context, "foo" appears in two instances:\n1. In the line: "bla bla foo"\n2. In the line: "foo bar"\n\nThe context does not explicitly define what "foo" is, but it is used as part of example text alongside other placeholder terms like "bla," "bar," "hello," and "world." No further explanation or meaning is given for "foo" in the context.'

reference


ElasticsearchRetriever

https://python.langchain.com/docs/integrations/retrievers/elasticsearch_retriever/

Mac本地docker安装Kibana+ElasticSearch

https://blog.csdn.net/liliang199/article/details/151581138

python访问基于docker搭建的elasticsearch

https://blog.csdn.net/liliang199/article/details/151586083

ElasticsearchRetriever构建参数说明

https://python.langchain.com/api_reference/elasticsearch/retrievers/langchain_elasticsearch.retrievers.ElasticsearchRetriever.html

python访问基于docker搭建的elasticsearch

https://blog.csdn.net/liliang199/article/details/151586083

Reciprocal rank fusion

https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion

OneAPI-通过OpenAI API访问所有大模型

https://blog.csdn.net/liliang199/article/details/151393128

ChatOpenAI

https://python.langchain.com/api_reference/community/chat_models/langchain_community.chat_models.openai.ChatOpenAI.html#langchain_community.chat_models.openai.ChatOpenAI

相关推荐
TGITCIC1 小时前
LangChain入门(十五)- LangGraph为什么这么香,看它是如何逆天DIFY的
langchain·工作流·rag·ai agent·ai智能体·langgraph·agentic
辰宇信息咨询7 小时前
3D自动光学检测(AOI)市场调研报告-发展趋势、机遇及竞争分析
大数据·数据分析
珠海西格8 小时前
“主动预防” vs “事后补救”:分布式光伏防逆流技术的代际革命,西格电力给出标准答案
大数据·运维·服务器·分布式·云计算·能源
创客匠人老蒋9 小时前
从数据库到智能体:教育企业如何构建自己的“数字大脑”?
大数据·人工智能·创客匠人
2501_948120159 小时前
基于大数据的泄漏仪设备监控系统
大数据
Spey_Events10 小时前
星箭聚力启盛会,2026第二届商业航天产业发展大会暨商业航天展即将开幕!
大数据·人工智能
AC赳赳老秦10 小时前
专利附图说明:DeepSeek生成的专业技术描述与权利要求书细化
大数据·人工智能·kafka·区块链·数据库开发·数据库架构·deepseek
GeeLark10 小时前
#请输入你的标签内容
大数据·人工智能·自动化
智能相对论11 小时前
2万台?九识无人车车队规模靠谱吗?
大数据
小小王app小程序开发12 小时前
淘宝扭蛋机小程序核心玩法拆解与技术运营分析
大数据·小程序