如何基于ElasticsearchRetriever构建RAG系统

ElasticSearch以其文本快速检索闻名,是构建文档类知识库的首选。

这里尝试基于ElasticsearchRetriever,基于langchain,构建RAG知识库系统。

1 elasticsearch

1.1 elasticsearc

elasticsearch是一款分布式的RESTful搜索分析引擎,提供了一个支持多租户的分布式全文搜索引擎,具备HTTP网络接口和无模式JSON文档存储特性,支持关键词搜索、向量搜索、混合搜索及复杂过滤功能。

这里elasticsearch通过docker安装,安装命令如下

获取ES镜像并为ES创建docker网络

docker pull docker.elastic.co/elasticsearch/elasticsearch-wolfi:9.1.3

docker network create elastic

docker pull docker.elastic.co/elasticsearch/elasticsearch:9.1.3

启动ES

docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB docker.elastic.co/elasticsearch/elasticsearch:9.1.3

具体过程参考

https://blog.csdn.net/liliang199/article/details/151581138

1.2 ElasticsearchRetriever

ElasticsearchRetriever是一个基于langchain的通用封装器,可通过Query DSL灵活访问Elasticsearch的所有功能。在大多数使用场景中,其他类(如ElasticsearchStore、ElasticsearchEmbeddings等)已能满足需求,但若遇到特殊需求时,则可选用ElasticsearchRetriever。

如需了解ElasticsearchRetriever全部功能与配置的详细说明,请参阅API参考文档

https://python.langchain.com/docs/integrations/retrievers/elasticsearch_retriever/

2 langchain

3.1 langchain

这里基于langchain集成elasticsearch,所以需要准备langchain环境,安装代码如下所示。

pip install -qU langchain-community langchain-elasticsearch -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install -qU langchain-openai -i https://pypi.tuna.tsinghua.edu.cn/simple

3.2 大模型配置

这里沿用langchain的使用习惯,使用OpenAI接口兼容的deepseek-r1大模型,具体为通过OneAPI中转调用Deepseek R1大模型。

实现过程参考

https://blog.csdn.net/liliang199/article/details/151393128

代码示例如下

复制代码
import os
os.environ['OPENAI_API_KEY'] = "sk-xxxxxxxx" # LLM模型令牌
os.environ['OPENAI_BASE_URL'] = "http://llm_model_provider_url" # LLM模型部署地址 

...

llm = ChatOpenAI(model="deepseek-r1")

3 测试验证

这里先验证ES的连接、数据导入、查询,然后通过ElasticsearchRetriever将es集成到langchain中,构建真实可运行的RAG系统。

3.1 ES连接

es_url表示ES的部署地址,passwd表示ES中elastic用户的密码。

ssl_context则表示es提供服务的证书,获取方式参考如下链接

https://blog.csdn.net/liliang199/article/details/151586083

因为使用OpenAI接口兼容的LLM模型,所以还需要配置令牌api_key和模型部署地址base_url。

示例代码如下

复制代码
import os
os.environ['OPENAI_API_KEY'] = "sk-xxxxxxxx" # LLM模型令牌
os.environ['OPENAI_BASE_URL'] = "http://llm_model_provider_url" # LLM模型部署地址 

import ssl
from typing import Any, Dict, Iterable

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from langchain_community.embeddings import DeterministicFakeEmbedding
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_elasticsearch import ElasticsearchRetriever

ssl_context = ssl.create_default_context(cafile='./http_ca.crt') # ES证书
es_url = "https://localhost:9200" # ES部署地址
passwd = "the passwd of the user elastic" # elastic用户的密码

es_client = Elasticsearch(hosts=[es_url],  basic_auth=('elastic', passwd), ssl_context=ssl_context)
es_client.info()

输出如下,说明ES连接成功。

ObjectApiResponse({'name': 'xxxx', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'xxxxxx, ..., 'tagline': 'You Know, for Search'})

3.2 数据导入

这里虚构7条测试数据texts = ["foo", "bar", "world", "hello world", "hello", "foo bar", "bla bla foo"]

并未7条测试数据创建向量索引,向量计算采用随机生成,对应DeterministicFakeEmbedding,实际环境可替换为OllamaEmbeddings。

实现代码如下所示,具体过程与批量向量化和导入ES的过程类似。

复制代码
embeddings = DeterministicFakeEmbedding(size=3)
index_name = "test-langchain-retriever_v0"
text_field = "text"
dense_vector_field = "fake_embedding"
num_characters_field = "num_characters"
texts = [
    "foo",
    "bar",
    "world",
    "hello world",
    "hello",
    "foo bar",
    "bla bla foo",
]

def create_index(
    es_client: Elasticsearch,
    index_name: str,
    text_field: str,
    dense_vector_field: str,
    num_characters_field: str,
):
    es_client.indices.create(
        index=index_name,
        mappings={
            "properties": {
                text_field: {"type": "text"},
                dense_vector_field: {"type": "dense_vector"},
                num_characters_field: {"type": "integer"},
            }
        },
    )


def index_data(
    es_client: Elasticsearch,
    index_name: str,
    text_field: str,
    dense_vector_field: str,
    embeddings: Embeddings,
    texts: Iterable[str],
    refresh: bool = True,
) -> None:
    create_index(
        es_client, index_name, text_field, dense_vector_field, num_characters_field
    )

    vectors = embeddings.embed_documents(list(texts))
    requests = [
        {
            "_op_type": "index",
            "_index": index_name,
            "_id": i,
            text_field: text,
            dense_vector_field: vector,
            num_characters_field: len(text),
        }
        for i, (text, vector) in enumerate(zip(texts, vectors))
    ]

    bulk(es_client, requests)

    if refresh:
        es_client.indices.refresh(index=index_name)

    return len(requests)


index_data(es_client, index_name, text_field, dense_vector_field, embeddings, texts)

正常情况下,输入如下所示。

7

3.3 数据查询

ES有不同的索引查询方式,比如向量查询、bm2.5查询、混合查询等,针对不同查询方式,需要构建对应的ElasticsearchRetriever。

1)向量查询

向量查询代码如下所示,主要为构建查询函数vecter_query,并构建vector_retriever。

在查询函数中,需要将问题search_query向量化,然后将获得的向量传入query_vector。

复制代码
def vector_query(search_query: str) -> Dict:
    vector = embeddings.embed_query(search_query)  # same embeddings as for indexing
    return {
        "knn": {
            "field": dense_vector_field,
            "query_vector": vector,
            "k": 5,
            "num_candidates": 10,
        }
    }


vector_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=vector_query,
    content_field=text_field,
    es_client=es_client
)

print("dd")

vector_retriever.invoke("foo")

输出如下所示

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9987202, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), ...

2)BM25查询

BM25其实就是传统字符串匹配查询,代码如下所示。

复制代码
def bm25_query(search_query: str) -> Dict:
    return {
        "query": {
            "match": {
                text_field: search_query,
            },
        },
    }


bm25_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=bm25_query,
    content_field=text_field,
    es_client=es_client
)

bm25_retriever.invoke("foo")

输出如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9711467, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.7437035, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.6025789, '_source': {'fake_embedding': \[1.7365927060137358, -0.5230400847844948, 0.7978339724186192\], 'num_characters': 11}}, page_content='bla bla foo')

3)混合查询

混合查询,就是在查询函数中,同时定义标准字符串匹配查询、knn向量查询,查询结果采用 Reciprocal Rank Fusion (RRF) 混合。

复制代码
def hybrid_query(search_query: str) -> Dict:
    vector = embeddings.embed_query(search_query)  # same embeddings as for indexing
    return {
        "retriever": {
            "rrf": {
                "retrievers": [
                    {
                        "standard": {
                            "query": {
                                "match": {
                                    text_field: search_query,
                                }
                            }
                        }
                    },
                    {
                        "knn": {
                            "field": dense_vector_field,
                            "query_vector": vector,
                            "k": 5,
                            "num_candidates": 10,
                        }
                    },
                ]
            }
        }
    }


hybrid_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=bm25_query,
    content_field=text_field,
    es_client=es_client
)

hybrid_retriever.invoke("foo")

输出示例如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.9711467, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.7437035, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.6025789, '_source': {'fake_embedding': \[1.7365927060137358, -0.5230400847844948, 0.7978339724186192\], 'num_characters': 11}}, page_content='bla bla foo')

4)模糊查询

示例代码如下,就是基于 typo tolerance的字符串匹配。

复制代码
def fuzzy_query(search_query: str) -> Dict:
    return {
        "query": {
            "match": {
                text_field: {
                    "query": search_query,
                    "fuzziness": "AUTO",
                }
            },
        },
    }


fuzzy_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=fuzzy_query,
    content_field=text_field,
    es_client=es_client
)

fuzzy_retriever.invoke("fox")  # note the character tolernace

输出如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '0', '_score': 0.6474311, '_source': {'fake_embedding': \[-2.336764233933763, 0.27510289545940503, -0.7957597268194339\], 'num_characters': 3}}, page_content='foo'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 0.49580228, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '6', '_score': 0.40171927, '_source': {'fake_embedding': \[1.7365927060137358, -0.5230400847844948, 0.7978339724186192\], 'num_characters': 11}}, page_content='bla bla foo')

5)复杂过滤

定义多种过滤方式,如must、must_not、should等,以提高查询效率。

代码示例如下

复制代码
def filter_query_func(search_query: str) -> Dict:
    return {
        "query": {
            "bool": {
                "must": [
                    {"range": {num_characters_field: {"gte": 5}}},
                ],
                "must_not": [
                    {"prefix": {text_field: "bla"}},
                ],
                "should": [
                    {"match": {text_field: search_query}},
                ],
            }
        }
    }


filtering_retriever = ElasticsearchRetriever(
    index_name=index_name,
    body_func=filter_query_func,
    content_field=text_field,
    es_client=es_client
)

filtering_retriever.invoke("foo")

输出如下

Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '5', '_score': 1.7437035, '_source': {'fake_embedding': \[0.2533670476638539, 0.08100381646160418, 0.7763644080870179\], 'num_characters': 7}}, page_content='foo bar'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '2', '_score': 1.0, '_source': {'fake_embedding': \[-0.7041151202179595, -1.4652961969276497, -0.25786766898672847\], 'num_characters': 5}}, page_content='world'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '3', '_score': 1.0, '_source': {'fake_embedding': \[0.42728413221815387, -1.1889908285425348, -1.445433230084671\], 'num_characters': 11}}, page_content='hello world'), Document(metadata={'_index': 'test-langchain-retriever_v0', '_id': '4', '_score': 1.0, '_source': {'fake_embedding': \[-0.28560441330564046, 0.9958894823084921, 1.5489829880195058\], 'num_characters': 5}}, page_content='hello')

6)文档映射

将ES的查询结果映射到langchain Document中,查询函数依然采用复杂过滤filter_query_func,结果融合函数num_characters_mapper示例如下,可依据实际情况自定义。

复制代码
def num_characters_mapper(hit: Dict[str, Any]) -> Document:
    num_chars = hit["_source"][num_characters_field]
    content = hit["_source"][text_field]
    return Document(
        page_content=f"This document has {num_chars} characters",
        metadata={"text_content": content},
    )


custom_mapped_retriever = ElasticsearchRetriever.from_es_params(
    index_name=index_name,
    body_func=filter_query_func,
    document_mapper=num_characters_mapper,
    url=es_url,
)

custom_mapped_retriever.invoke("foo")

输出示例如下

Document(metadata={'text_content': 'foo bar'}, page_content='This document has 7 characters'), Document(metadata={'text_content': 'world'}, page_content='This document has 5 characters'), Document(metadata={'text_content': 'hello world'}, page_content='This document has 11 characters'), Document(metadata={'text_content': 'hello'}, page_content='This document has 5 characters')

3.4 langchian

这里基于之前验证的ElasticsearchRetriever,结合ChatOpenAI自定义大模型,构建一个完整的langchain RAG系统,chain定义如下

chain = (

{"context": vector_retriever | format_docs, "question": RunnablePassthrough()}

| prompt

| llm

| StrOutputParser()

)

整体代码示例如下

复制代码
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

prompt = ChatPromptTemplate.from_template(
    """Answer the question based only on the context provided.

Context: {context}

Question: {question}"""
)

llm = ChatOpenAI(model="deepseek-r1")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

chain = (
    {"context": vector_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

然后,启动chain,完整真实的RAG知识库检索。

复制代码
chain.invoke("what is foo?")

输出如下所示,可见这里不仅实现了检索功能,而且依据大模型对检索结果进行了有效的处理。

'Based on the provided context, "foo" appears in two instances:\n1. In the line: "bla bla foo"\n2. In the line: "foo bar"\n\nThe context does not explicitly define what "foo" is, but it is used as part of example text alongside other placeholder terms like "bla," "bar," "hello," and "world." No further explanation or meaning is given for "foo" in the context.'

reference


ElasticsearchRetriever

https://python.langchain.com/docs/integrations/retrievers/elasticsearch_retriever/

Mac本地docker安装Kibana+ElasticSearch

https://blog.csdn.net/liliang199/article/details/151581138

python访问基于docker搭建的elasticsearch

https://blog.csdn.net/liliang199/article/details/151586083

ElasticsearchRetriever构建参数说明

https://python.langchain.com/api_reference/elasticsearch/retrievers/langchain_elasticsearch.retrievers.ElasticsearchRetriever.html

python访问基于docker搭建的elasticsearch

https://blog.csdn.net/liliang199/article/details/151586083

Reciprocal rank fusion

https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion

OneAPI-通过OpenAI API访问所有大模型

https://blog.csdn.net/liliang199/article/details/151393128

ChatOpenAI

https://python.langchain.com/api_reference/community/chat_models/langchain_community.chat_models.openai.ChatOpenAI.html#langchain_community.chat_models.openai.ChatOpenAI

相关推荐
乐迪信息5 小时前
乐迪信息:基于AI算法的煤矿作业人员安全规范智能监测与预警系统
大数据·人工智能·算法·安全·视觉检测·推荐算法
极验5 小时前
iPhone17实体卡槽消失?eSIM 普及下的安全挑战与应对
大数据·运维·安全
东方佑5 小时前
基于FastAPI与LangChain的Excel智能数据分析API开发实践
langchain·excel·fastapi
相与还5 小时前
IDEA和GIT实现cherry pick拣选部分变更到新分支
git·elasticsearch·intellij-idea
B站_计算机毕业设计之家5 小时前
推荐系统实战:python新能源汽车智能推荐(两种协同过滤+Django 全栈项目 源码)计算机专业✅
大数据·python·django·汽车·推荐系统·新能源·新能源汽车
The Sheep 20236 小时前
WPF自定义路由事件
大数据·hadoop·wpf
SelectDB技术团队6 小时前
Apache Doris 内部数据裁剪与过滤机制的实现原理 | Deep Dive
大数据·数据库·apache·数据库系统·数据裁剪
WLJT1231231237 小时前
科技赋能塞上农业:宁夏从黄土地到绿硅谷的蝶变
大数据·人工智能·科技