如何基于ElasticsearchRetriever构建RAG系统

ElasticSearch以其文本快速检索闻名,是构建文档类知识库的首选。

这里尝试基于ElasticsearchRetriever,基于langchain,构建RAG知识库系统。

1 elasticsearch

1.1 elasticsearc

elasticsearch是一款分布式的RESTful搜索分析引擎,提供了一个支持多租户的分布式全文搜索引擎,具备HTTP网络接口和无模式JSON文档存储特性,支持关键词搜索、向量搜索、混合搜索及复杂过滤功能。

这里elasticsearch通过docker安装,安装命令如下

获取ES镜像并为ES创建docker网络

docker pull docker.elastic.co/elasticsearch/elasticsearch-wolfi:9.1.3

docker network create elastic

docker pull docker.elastic.co/elasticsearch/elasticsearch:9.1.3

启动ES

docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB docker.elastic.co/elasticsearch/elasticsearch:9.1.3

具体过程参考

https://blog.csdn.net/liliang199/article/details/151581138

1.2 ElasticsearchRetriever

ElasticsearchRetriever是一个基于langchain的通用封装器,可通过Query DSL灵活访问Elasticsearch的所有功能。在大多数使用场景中,其他类(如ElasticsearchStore、ElasticsearchEmbeddings等)已能满足需求,但若遇到特殊需求时,则可选用ElasticsearchRetriever。

如需了解ElasticsearchRetriever全部功能与配置的详细说明,请参阅API参考文档

https://python.langchain.com/docs/integrations/retrievers/elasticsearch_retriever/

2 langchain

3.1 langchain

这里基于langchain集成elasticsearch,所以需要准备langchain环境,安装代码如下所示。

pip install -qU langchain-community langchain-elasticsearch -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install -qU langchain-openai -i https://pypi.tuna.tsinghua.edu.cn/simple

3.2 大模型配置

这里沿用langchain的使用习惯,使用OpenAI接口兼容的deepseek-r1大模型,具体为通过OneAPI中转调用Deepseek R1大模型。

实现过程参考

https://blog.csdn.net/liliang199/article/details/151393128

代码示例如下

复制代码
import os
os.environ['OPENAI_API_KEY'] = "sk-xxxxxxxx" # LLM模型令牌
os.environ['OPENAI_BASE_URL'] = "http://llm_model_provider_url" # LLM模型部署地址 

...

llm = ChatOpenAI(model="deepseek-r1")

3 测试验证

这里先验证ES的连接、数据导入、查询,然后通过ElasticsearchRetriever将es集成到langchain中,构建真实可运行的RAG系统。

3.1 ES连接

es_url表示ES的部署地址,passwd表示ES中elastic用户的密码。

ssl_context则表示es提供服务的证书,获取方式参考如下链接

https://blog.csdn.net/liliang199/article/details/151586083

因为使用OpenAI接口兼容的LLM模型,所以还需要配置令牌api_key和模型部署地址base_url。

示例代码如下

复制代码
import os
os.environ['OPENAI_API_KEY'] = "sk-xxxxxxxx" # LLM模型令牌
os.environ['OPENAI_BASE_URL'] = "http://llm_model_provider_url" # LLM模型部署地址 

import ssl
from typing import Any, Dict, Iterable

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from langchain_community.embeddings import DeterministicFakeEmbedding
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_elasticsearch import ElasticsearchRetriever

ssl_context = ssl.create_default_context(cafile='./http_ca.crt') # ES证书
es_url = "https://localhost:9200" # ES部署地址
passwd = "the passwd of the user elastic" # elastic用户的密码

es_client = Elasticsearch(hosts=[es_url],  basic_auth=('elastic', passwd), ssl_context=ssl_context)
es_client.info()

输出如下,说明ES连接成功。

ObjectApiResponse({'name': 'xxxx', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'xxxxxx, ..., 'tagline': 'You Know, for Search'})

3.2 数据导入

这里虚构7条测试数据texts = ["foo", "bar", "world", "hello world", "hello", "foo bar", "bla bla foo"]

并未7条测试数据创建向量索引,向量计算采用随机生成,对应DeterministicFakeEmbedding,实际环境可替换为OllamaEmbeddings。

实现代码如下所示,具体过程与批量向量化和导入ES的过程类似。

复制代码
embeddings = DeterministicFakeEmbedding(size=3)
index_name = "test-langchain-retriever_v0"
text_field = "text"
dense_vector_field = "fake_embedding"
num_characters_field = "num_characters"
texts = [
    "foo",
    "bar",
    "world",
    "hello world",
    "hello",
    "foo bar",
    "bla bla foo",
]

def create_index(
    es_client: Elasticsearch,
    index_name: str,
    text_field: str,
    dense_vector_field: str,
    num_characters_field: str,
):
    es_client.indices.create(
        index=index_name,
        mappings={
            "properties": {
                text_field: {"type": "text"},
                dense_vector_field: {"type": "dense_vector"},
                num_characters_field: {"type": "integer"},
            }
        },
    )


def index_data(
    es_client: Elasticsearch,
    index_name: str,
    text_field: str,
    dense_vector_field: str,
    embeddings: Embeddings,
    texts: Iterable[str],
    refresh: bool = True,
) -> None:
    create_index(
        es_client, index_name, text_field, dense_vector_field, num_characters_field
    )

    vectors = embeddings.embed_documents(list(texts))
    requests = [
        {
            "_op_type": "index",
            "_index": index_name,
            "_id": i,
            text_field: text,
            dense_vector_field: vector,
            num_characters_field: len(text),
        }
        for i, (text, vector) in enumerate(zip(texts, vectors))
    ]

    bulk(es_client, requests)

    if refresh:
        es_client.indices.refresh(index=index_name)

    return len(requests)


index_data(es_client, index_name, text_field, dense_vector_field, embeddings, texts)

正常情况下,输入如下所示。

7

3.3 数据查询

ES有不同的索引查询方式,比如向量查询、bm2.5查询、混合查询等,针对不同查询方式,需要构建对应的ElasticsearchRetriever。

1)向量查询

向量查询代码如下所示,主要为构建查询函数vecter_query,并构建vector_retriever。

在查询函数中,需要将问题search_query向量化,然后将获得的向量传入query_vector。

复制代码
def vector_query(search_query: str) -> Dict:
    vector = embeddings.embed_query(search_query)  # same embeddings as for indexing
    return {
        "knn": {
            "field": dense_vector_field,
            "query_vector": vector,
            "k": 5,
            "num_candidates": 10,
        }
    }


vector_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=vector_query,
    content_field=text_field,
    es_client=es_client
)

print("dd")

vector_retriever.invoke("foo")

输出如下所示

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9987202, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), ...

2)BM25查询

BM25其实就是传统字符串匹配查询,代码如下所示。

复制代码
def bm25_query(search_query: str) -> Dict:
    return {
        "query": {
            "match": {
                text_field: search_query,
            },
        },
    }


bm25_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=bm25_query,
    content_field=text_field,
    es_client=es_client
)

bm25_retriever.invoke("foo")

输出如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9711467, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.7437035, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.6025789, '_source': {'fake_embedding': \[1.7365927060137358, -0.5230400847844948, 0.7978339724186192\], 'num_characters': 11}}, page_content='bla bla foo')

3)混合查询

混合查询,就是在查询函数中,同时定义标准字符串匹配查询、knn向量查询,查询结果采用 Reciprocal Rank Fusion (RRF) 混合。

复制代码
def hybrid_query(search_query: str) -> Dict:
    vector = embeddings.embed_query(search_query)  # same embeddings as for indexing
    return {
        "retriever": {
            "rrf": {
                "retrievers": [
                    {
                        "standard": {
                            "query": {
                                "match": {
                                    text_field: search_query,
                                }
                            }
                        }
                    },
                    {
                        "knn": {
                            "field": dense_vector_field,
                            "query_vector": vector,
                            "k": 5,
                            "num_candidates": 10,
                        }
                    },
                ]
            }
        }
    }


hybrid_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=bm25_query,
    content_field=text_field,
    es_client=es_client
)

hybrid_retriever.invoke("foo")

输出示例如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9711467, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.7437035, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.6025789, '_source': {'fake_embedding': \[1.7365927060137358, -0.5230400847844948, 0.7978339724186192\], 'num_characters': 11}}, page_content='bla bla foo')

4)模糊查询

示例代码如下,就是基于 typo tolerance的字符串匹配。

复制代码
def fuzzy_query(search_query: str) -> Dict:
    return {
        "query": {
            "match": {
                text_field: {
                    "query": search_query,
                    "fuzziness": "AUTO",
                }
            },
        },
    }


fuzzy_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=fuzzy_query,
    content_field=text_field,
    es_client=es_client
)

fuzzy_retriever.invoke("fox")  # note the character tolernace

输出如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.6474311, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.49580228, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.40171927, '_source': {'fake_embedding': \[1.7365927060137358, -0.5230400847844948, 0.7978339724186192\], 'num_characters': 11}}, page_content='bla bla foo')

5)复杂过滤

定义多种过滤方式,如must、must_not、should等,以提高查询效率。

代码示例如下

复制代码
def filter_query_func(search_query: str) -> Dict:
    return {
        "query": {
            "bool": {
                "must": [
                    {"range": {num_characters_field: {"gte": 5}}},
                ],
                "must_not": [
                    {"prefix": {text_field: "bla"}},
                ],
                "should": [
                    {"match": {text_field: search_query}},
                ],
            }
        }
    }


filtering_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=filter_query_func,
    content_field=text_field,
    es_client=es_client
)

filtering_retriever.invoke("foo")

输出如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 1.7437035, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '2', '_score': 1.0, '_source': {'fake_embedding': \[-0.7041151202179595, -1.4652961969276497, -0.25786766898672847\], 'num_characters': 5}}, page_content='world'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '3', '_score': 1.0, '_source': {'fake_embedding': \[0.42728413221815387, -1.1889908285425348, -1.445433230084671\], 'num_characters': 11}}, page_content='hello world'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '4', '_score': 1.0, '_source': {'fake_embedding': \[-0.28560441330564046, 0.9958894823084921, 1.5489829880195058\], 'num_characters': 5}}, page_content='hello')

6)文档映射

将ES的查询结果映射到langchain Document中,查询函数依然采用复杂过滤filter_query_func,结果融合函数num_characters_mapper示例如下,可依据实际情况自定义。

复制代码
def num_characters_mapper(hit: Dict[str, Any]) -> Document:
    num_chars = hit["_source"][num_characters_field]
    content = hit["_source"][text_field]
    return Document(
        page_content=f"This document has {num_chars} characters",
        metadata={"text_content": content},
    )


custom_mapped_retriever = ElasticsearchRetriever.from_es_params(
    index_name=index_name,
    body_func=filter_query_func,
    document_mapper=num_characters_mapper,
    url=es_url,
)

custom_mapped_retriever.invoke("foo")

输出示例如下

Document(metadata={'text_content': 'foo bar'}, page_content='This document has 7 characters'), Document(metadata={'text_content': 'world'}, page_content='This document has 5 characters'), Document(metadata={'text_content': 'hello world'}, page_content='This document has 11 characters'), Document(metadata={'text_content': 'hello'}, page_content='This document has 5 characters')

3.4 langchian

这里基于之前验证的ElasticsearchRetriever,结合ChatOpenAI自定义大模型,构建一个完整的langchain RAG系统,chain定义如下

chain = (

{"context": vector_retriever | format_docs, "question": RunnablePassthrough()}

| prompt

| llm

| StrOutputParser()

)

整体代码示例如下

复制代码
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_template(
    """Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)

llm = ChatOpenAI(model="deepseek-r1")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

chain = (
    {"context": vector_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

然后,启动chain,完整真实的RAG知识库检索。

复制代码
chain.invoke("what is foo?")

输出如下所示,可见这里不仅实现了检索功能,而且依据大模型对检索结果进行了有效的处理。

'Based on the provided context, "foo" appears in two instances:\n1. In the line: "bla bla foo"\n2. In the line: "foo bar"\n\nThe context does not explicitly define what "foo" is, but it is used as part of example text alongside other placeholder terms like "bla," "bar," "hello," and "world." No further explanation or meaning is given for "foo" in the context.'

reference


ElasticsearchRetriever

https://python.langchain.com/docs/integrations/retrievers/elasticsearch_retriever/

Mac本地docker安装Kibana+ElasticSearch

https://blog.csdn.net/liliang199/article/details/151581138

python访问基于docker搭建的elasticsearch

https://blog.csdn.net/liliang199/article/details/151586083

ElasticsearchRetriever构建参数说明

https://python.langchain.com/api_reference/elasticsearch/retrievers/langchain_elasticsearch.retrievers.ElasticsearchRetriever.html

python访问基于docker搭建的elasticsearch

https://blog.csdn.net/liliang199/article/details/151586083

Reciprocal rank fusion

https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion

OneAPI-通过OpenAI API访问所有大模型

https://blog.csdn.net/liliang199/article/details/151393128

ChatOpenAI

https://python.langchain.com/api_reference/community/chat_models/langchain_community.chat_models.openai.ChatOpenAI.html#langchain_community.chat_models.openai.ChatOpenAI

相关推荐
Elastic 中国社区官方博客16 小时前
在 Elasticsearch 中使用 Mistral Chat completions 进行上下文工程
大数据·数据库·人工智能·elasticsearch·搜索引擎·ai·全文检索
橙色云-智橙协同研发16 小时前
从 CAD 图纸到 Excel 数据:橙色云智橙 PLM 打造制造企业数字化协同新模式
大数据·功能测试·云原生·cad·plm·云plm·bom提取
喝可乐的希饭a17 小时前
Elasticsearch 的 Routing 策略详解
大数据·elasticsearch·搜索引擎
_李小白17 小时前
【OPENGL ES 3.0 学习笔记】延伸阅读:VAO与VBO
笔记·学习·elasticsearch
TDengine (老段)19 小时前
TDengine 字符串函数 CHAR 用户手册
java·大数据·数据库·物联网·时序数据库·tdengine·涛思数据
2501_9336707919 小时前
高职大数据技术专业需要的基础
大数据
一个处女座的暖男程序猿19 小时前
2G2核服务器安装ES
服务器·elasticsearch·jenkins
科技峰行者20 小时前
微软与OpenAI联合研发“Orion“超大规模AI模型:100万亿参数开启“科学家AI“新纪元
大数据·人工智能·microsoft
拓端研究室21 小时前
2025母婴用品双11营销解码与AI应用洞察报告|附40+份报告PDF、数据、绘图模板汇总下载
大数据·人工智能
GOATLong21 小时前
git使用
大数据·c语言·c++·git·elasticsearch