Milvus 视角看重排序模型(Rerankers)

在信息检索和生成式人工智能领域,重排序器是优化初始搜索结果顺序的重要工具。重排序器与传统的嵌入模型不同,它将查询和文档作为输入,并直接返回相似度得分,而不是嵌入。该得分表示输入查询和文档之间的相关性。

重排序器通常在第一阶段检索之后使用,通常通过向量近似最近邻 (ANN) 技术完成。虽然 ANN 搜索能够高效地获取大量潜在相关的结果,但它们并不总是根据结果与查询的实际语义接近程度来确定优先级。此时,重排序器会通过更深入的上下文分析来优化结果顺序,通常会利用 BERT 或其他基于 Transformer 的高级机器学习模型。通过这种方式,重排序器可以显著提高呈现给用户的最终结果的准确性和相关性。

PyMilvus 模型库集成了重排序功能,用于优化初始搜索返回结果的排序。从 Milvus 检索到最近的嵌入后,您可以利用这些重排序工具来优化搜索结果,从而提高搜索结果的准确率。

Rerank Function API or Open-sourced
BGE Open-sourced
Cross Encoder Open-sourced
Voyage API
Cohere API
Jina AI API

示例 1:使用 BGE rerank 函数根据查询对文档进行重新排序

在此示例中,我们演示了如何使用基于特定查询的BGE 重新排序器对搜索结果进行重新排序。

要将重新排序器与PyMilvus 模型库一起使用,首先安装 PyMilvus 模型库以及包含所有必要的重新排序实用程序的模型子包:

复制代码
`pip install pymilvus[model]
`# or pip install "pymilvus[model]" for zsh.

要使用 BGE 重新排序器,首先导入BGERerankFunction类:

复制代码
`from pymilvus.model.reranker import BGERerankFunction`

然后,创建一个BGERerankFunction重新排名的实例:

复制代码
`bge_rf = BGERerankFunction(
    model_name="BAAI/bge-reranker-v2-m3",  `# Specify the model name. Defaults to `BAAI/bge-reranker-v2-m3`.`
    device="cpu" `# Specify the device to use, e.g., 'cpu' or 'cuda:0'`
)`

要根据查询对文档重新排序,请使用以下代码:

复制代码
`query = "What event in 1956 marked the official birth of artificial intelligence as a discipline?"

documents = [
    "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.",
    "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.",
    "In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.",
    "The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."
]

bge_rf(query, documents)`

预期输出类似于以下内容:

复制代码
`[RerankResult(text="The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.", score=0.9911615761470803, index=1),
 RerankResult(text="In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.", score=0.0326971950177779, index=0),
 RerankResult(text='The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.', score=0.006514905766152258, index=3),
 RerankResult(text='In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.', score=0.0042116724917325935, index=2)]`

示例 2:使用重新排序器增强搜索结果的相关性

在本指南中,我们将探索如何利用search()PyMilvus 中的方法进行相似性搜索,以及如何使用重排序器增强搜索结果的相关性。我们的演示将使用以下数据集:

复制代码
`entities = [
    {'doc_id': 0, 'doc_vector': [-0.0372721,0.0101959,...,-0.114994], 'doc_text': "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence."}, 
    {'doc_id': 1, 'doc_vector': [-0.00308882,-0.0219905,...,-0.00795811], 'doc_text': "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals."}, 
    {'doc_id': 2, 'doc_vector': [0.00945078,0.00397605,...,-0.0286199], 'doc_text': 'In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.'}, 
    {'doc_id': 3, 'doc_vector': [-0.0391119,-0.00880096,...,-0.0109257], 'doc_text': 'The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.'}
]`

数据集组件

  • doc_id:每个文档的唯一标识符。
  • doc_vector:表示文档的向量嵌入。有关生成嵌入的指导,请参阅嵌入
  • doc_text:文档的文本内容。

准备工作

在启动相似性搜索之前,您需要与 Milvus 建立连接,创建一个集合,并准备数据并将其插入到该集合中。以下代码片段演示了这些准备步骤。

复制代码
`from pymilvus import MilvusClient, DataType

client = MilvusClient(
    uri="http://10.102.6.214:19530" `# replace with your own Milvus server address`
)

client.drop_collection('test_collection')

`# define schema`

schema = client.create_schema(auto_id=False, enabel_dynamic_field=True)

schema.add_field(field_name="doc_id", datatype=DataType.INT64, is_primary=True, description="document id")
schema.add_field(field_name="doc_vector", datatype=DataType.FLOAT_VECTOR, dim=384, description="document vector")
schema.add_field(field_name="doc_text", datatype=DataType.VARCHAR, max_length=65535, description="document text")

`# define index params`

index_params = client.prepare_index_params()

index_params.add_index(field_name="doc_vector", index_type="IVF_FLAT", metric_type="IP", params={"nlist": 128})

`# create collection`

client.create_collection(collection_name="test_collection", schema=schema, index_params=index_params)

`# insert data into collection`

client.insert(collection_name="test_collection", data=entities)

`# Output:
# {'insert_count': 4, 'ids': [0, 1, 2, 3]}

进行相似性搜索

数据插入后,使用该方法进行相似性搜索search

复制代码
# search results based on our query`

res = client.search(
    collection_name="test_collection",
    data=[[-0.045217834, 0.035171617, ..., -0.025117004]], `# replace with your query vector`
    limit=3,
    output_fields=["doc_id", "doc_text"]
)

for i in res[0]:
    print(f'distance: {i["distance"]}')
    print(f'doc_text: {i["entity"]["doc_text"]}')`

预期输出类似于以下内容:

复制代码
`distance: 0.7235960960388184
doc_text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.
distance: 0.6269873976707458
doc_text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.
distance: 0.5340118408203125
doc_text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.`

使用重新排序器来增强搜索结果

然后,通过重新排序步骤来提高搜索结果的相关性。在本例中,我们使用CrossEncoderRerankFunction内置的 PyMilvus 对结果进行重新排序,以提高准确率。

复制代码
# use reranker to rerank search results`

from pymilvus.model.reranker import CrossEncoderRerankFunction

ce_rf = CrossEncoderRerankFunction(
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",  `# Specify the model name.`
    device="cpu" `# Specify the device to use, e.g., 'cpu' or 'cuda:0'`
)

reranked_results = ce_rf(
    query='What event in 1956 marked the official birth of artificial intelligence as a discipline?',
    documents=[
        "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.",
        "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.",
        "In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.",
        "The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."
    ],
    top_k=3
)

`# print the reranked results`
for result in reranked_results:
    print(f'score: {result.score}')
    print(f'doc_text: {result.text}')`

预期输出类似于以下内容:

复制代码
`score: 6.250532627105713
doc_text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.
score: -2.9546022415161133
doc_text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.
score: -4.771512031555176
doc_text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.`
相关推荐
小的~~10 小时前
Milvus会存在SQL注入攻击吗?
数据库·sql·milvus
李景琰2 天前
Spring AI + Milvus向量数据库:企业级RAG架构实战
人工智能·spring·milvus
dllmayday3 天前
Milvus在LangChain中使用方法
人工智能·ai·langchain·milvus
风落无尘3 天前
Milvus 向量索引与 RAG 文档切片:从入门到选型(附速记卡片)
milvus
程序员老邢4 天前
【产品底稿 08】商助慧 AI 仿写实战复盘:RAG 知识库 + 大模型联动,一键生成技术底稿
人工智能·spring boot·后端·ai·语言模型·milvus
青龙小码农4 天前
milvus+elasticsearch+ollama实现企业级RAG搭建
elasticsearch·milvus·ollama·rga
AI木马人6 天前
8.【向量数据库深度对比】Milvus vs FAISS vs Pinecone(真实项目选型指南)
数据库·milvus·faiss
YiRan_Zhao7 天前
milvus面试题
milvus
许彰午7 天前
# 约94万条热线问题怎么去重?动态相似度阈值+Milvus,不用LLM一毛钱
人工智能·milvus
程序员老邢7 天前
【技术底稿 23】Ollama + Docker + Ubuntu 部署踩坑实录:网络通了,参数还在调
java·经验分享·后端·ubuntu·docker·容器·milvus