langchain 基于ES的数据向量化存储和检索

中文向量化模型候选:

1、sentence-transformers/all-MiniLM-L6-v2 向量维度为384维,支持多种语言。

2、BAAI/bge-m3

3、多语言模型:BAAI/bge-m3 支持的输入长度<=8192

from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "sentence-transformers/all-MiniLM-L6-v2"

model_kwargs = {"device": "cpu"}

encode_kwargs = {"normalize_embeddings": True}

embeddings = HuggingFaceBgeEmbeddings(

model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs

)

1·、存储源为elasticsearch

from typing import Any, Dict, Iterable

from elasticsearch import Elasticsearch

from elasticsearch.helpers import bulk

from langchain.embeddings import DeterministicFakeEmbedding

from langchain_core.documents import Document

from langchain_core.embeddings import Embeddings

from langchain_elasticsearch import ElasticsearchRetriever

es_url = "http://user:password@localhost:9200"

es_client = Elasticsearch(hosts=[es_url])

es_client.info()

index_name = "test-langchain-retriever"

text_field = "text"

dense_vector_field = "fake_embedding"

num_characters_field = "num_characters"

texts = [

"foo",

"bar",

"world",

"hello world",

"hello",

"foo bar",

"bla bla foo",

]

def create_index(

es_client: Elasticsearch,

index_name: str,

text_field: str,

dense_vector_field: str,

num_characters_field: str,

):

es_client.indices.create(

index=index_name,

mappings={

"properties": {

text_field: {"type": "text"},

dense_vector_field: {"type": "dense_vector"},

num_characters_field: {"type": "integer"},

}

},

)

def index_data(

es_client: Elasticsearch,

index_name: str,

text_field: str,

dense_vector_field: str,

embeddings: Embeddings,

texts: Iterable[str],

refresh: bool = True,

) -> None:

create_index(

es_client, index_name, text_field, dense_vector_field, num_characters_field

)

vectors = embeddings.embed_documents(list(texts))

requests = [

{

"_op_type": "index",

"_index": index_name,

"_id": i,

text_field: text,

dense_vector_field: vector,

num_characters_field: len(text),

}

for i, (text, vector) in enumerate(zip(texts, vectors))

]

bulk(es_client, requests)

if refresh:

es_client.indices.refresh(index=index_name)

index_data(es_client, index_name, text_field, dense_vector_field, embeddings, texts)

2、elasticsearch 向量检索:

es_url = "http://user:password@localhost:9200"

index_name = "test-langchain-retriever"

text_field = "text"

dense_vector_field = "fake_embedding"

num_characters_field = "num_characters"

def gen_dsl(search_query: str) -> Dict:

vector = embeddings.embed_query(search_query) # same embeddings as for indexing

return {

"knn": {

"field": dense_vector_field,

"query_vector": vector,

"k": 5,

"num_candidates": 10,

}

}

vector_retriever = ElasticsearchRetriever.from_es_params(

index_name=index_name,

body_func=vector_query,

content_field=text_field,

url=es_url,

)

vector_retriever.invoke("foo")

说明:简单的向量检索,耗时比较长。

原因:1、直接对全局使用了余弦相似度计算。(cos),未做任何优化

2、返回数据将向量内容全部返回

相关推荐
louiX2 小时前
初级 AI Agent 工程师
langchain·agent·客户端
幸福巡礼3 小时前
【LangChain 1.2 实战(六)】 工具调用 (Function Calling)
langchain
Irissgwe5 小时前
LangChain之核心组件(少样本提示词)
人工智能·langchain·llm·langgraph
lbb 小魔仙7 小时前
Python + LangChain 环境搭建完全指南:从零构建本地 RAG 知识库(附 Ollama 本地模型集成)
开发语言·python·langchain
风落无尘7 小时前
我用 LangChain 写了一个带“定速巡航”的向量化工具,发布到 PyPI 了!
人工智能·python·langchain
BU摆烂会噶8 小时前
【LangGraph】 流式处理入门
人工智能·python·langchain·人机交互
大模型真好玩8 小时前
LangChain DeepAgents 速通指南(八)—— DeepAgents流式输出详解
人工智能·langchain·agent
沪漂阿龙8 小时前
AI Agent爆火,但你真的懂LangChain吗?——大模型智能体开发完全指南
人工智能·langchain
庞轩px8 小时前
LangChain不是“套壳”——它解决了什么实际问题
langchain·大模型·agent·tool·ai应用开发
qq_283720058 小时前
LangChain 动态模型中间件实战使用技巧
中间件·langchain·middleware·wrap_model_call