Milvus 视角看重排序模型(Rerankers)

在信息检索和生成式人工智能领域,重排序器是优化初始搜索结果顺序的重要工具。重排序器与传统的嵌入模型不同,它将查询和文档作为输入,并直接返回相似度得分,而不是嵌入。该得分表示输入查询和文档之间的相关性。

重排序器通常在第一阶段检索之后使用,通常通过向量近似最近邻 (ANN) 技术完成。虽然 ANN 搜索能够高效地获取大量潜在相关的结果,但它们并不总是根据结果与查询的实际语义接近程度来确定优先级。此时,重排序器会通过更深入的上下文分析来优化结果顺序,通常会利用 BERT 或其他基于 Transformer 的高级机器学习模型。通过这种方式,重排序器可以显著提高呈现给用户的最终结果的准确性和相关性。

PyMilvus 模型库集成了重排序功能,用于优化初始搜索返回结果的排序。从 Milvus 检索到最近的嵌入后,您可以利用这些重排序工具来优化搜索结果,从而提高搜索结果的准确率。

Rerank Function API or Open-sourced
BGE Open-sourced
Cross Encoder Open-sourced
Voyage API
Cohere API
Jina AI API

示例 1:使用 BGE rerank 函数根据查询对文档进行重新排序

在此示例中,我们演示了如何使用基于特定查询的BGE 重新排序器对搜索结果进行重新排序。

要将重新排序器与PyMilvus 模型库一起使用,首先安装 PyMilvus 模型库以及包含所有必要的重新排序实用程序的模型子包:

复制代码
`pip install pymilvus[model]
`# or pip install "pymilvus[model]" for zsh.

要使用 BGE 重新排序器,首先导入BGERerankFunction类:

复制代码
`from pymilvus.model.reranker import BGERerankFunction`

然后,创建一个BGERerankFunction重新排名的实例:

复制代码
`bge_rf = BGERerankFunction(
    model_name="BAAI/bge-reranker-v2-m3",  `# Specify the model name. Defaults to `BAAI/bge-reranker-v2-m3`.`
    device="cpu" `# Specify the device to use, e.g., 'cpu' or 'cuda:0'`
)`

要根据查询对文档重新排序,请使用以下代码:

复制代码
`query = "What event in 1956 marked the official birth of artificial intelligence as a discipline?"

documents = [
    "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.",
    "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.",
    "In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.",
    "The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."
]

bge_rf(query, documents)`

预期输出类似于以下内容:

复制代码
`[RerankResult(text="The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.", score=0.9911615761470803, index=1),
 RerankResult(text="In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.", score=0.0326971950177779, index=0),
 RerankResult(text='The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.', score=0.006514905766152258, index=3),
 RerankResult(text='In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.', score=0.0042116724917325935, index=2)]`

示例 2:使用重新排序器增强搜索结果的相关性

在本指南中,我们将探索如何利用search()PyMilvus 中的方法进行相似性搜索,以及如何使用重排序器增强搜索结果的相关性。我们的演示将使用以下数据集:

复制代码
`entities = [
    {'doc_id': 0, 'doc_vector': [-0.0372721,0.0101959,...,-0.114994], 'doc_text': "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence."}, 
    {'doc_id': 1, 'doc_vector': [-0.00308882,-0.0219905,...,-0.00795811], 'doc_text': "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals."}, 
    {'doc_id': 2, 'doc_vector': [0.00945078,0.00397605,...,-0.0286199], 'doc_text': 'In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.'}, 
    {'doc_id': 3, 'doc_vector': [-0.0391119,-0.00880096,...,-0.0109257], 'doc_text': 'The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.'}
]`

数据集组件

  • doc_id:每个文档的唯一标识符。
  • doc_vector:表示文档的向量嵌入。有关生成嵌入的指导,请参阅嵌入
  • doc_text:文档的文本内容。

准备工作

在启动相似性搜索之前,您需要与 Milvus 建立连接,创建一个集合,并准备数据并将其插入到该集合中。以下代码片段演示了这些准备步骤。

复制代码
`from pymilvus import MilvusClient, DataType

client = MilvusClient(
    uri="http://10.102.6.214:19530" `# replace with your own Milvus server address`
)

client.drop_collection('test_collection')

`# define schema`

schema = client.create_schema(auto_id=False, enabel_dynamic_field=True)

schema.add_field(field_name="doc_id", datatype=DataType.INT64, is_primary=True, description="document id")
schema.add_field(field_name="doc_vector", datatype=DataType.FLOAT_VECTOR, dim=384, description="document vector")
schema.add_field(field_name="doc_text", datatype=DataType.VARCHAR, max_length=65535, description="document text")

`# define index params`

index_params = client.prepare_index_params()

index_params.add_index(field_name="doc_vector", index_type="IVF_FLAT", metric_type="IP", params={"nlist": 128})

`# create collection`

client.create_collection(collection_name="test_collection", schema=schema, index_params=index_params)

`# insert data into collection`

client.insert(collection_name="test_collection", data=entities)

`# Output:
# {'insert_count': 4, 'ids': [0, 1, 2, 3]}

进行相似性搜索

数据插入后,使用该方法进行相似性搜索search

复制代码
# search results based on our query`

res = client.search(
    collection_name="test_collection",
    data=[[-0.045217834, 0.035171617, ..., -0.025117004]], `# replace with your query vector`
    limit=3,
    output_fields=["doc_id", "doc_text"]
)

for i in res[0]:
    print(f'distance: {i["distance"]}')
    print(f'doc_text: {i["entity"]["doc_text"]}')`

预期输出类似于以下内容:

复制代码
`distance: 0.7235960960388184
doc_text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.
distance: 0.6269873976707458
doc_text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.
distance: 0.5340118408203125
doc_text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.`

使用重新排序器来增强搜索结果

然后,通过重新排序步骤来提高搜索结果的相关性。在本例中,我们使用CrossEncoderRerankFunction内置的 PyMilvus 对结果进行重新排序,以提高准确率。

复制代码
# use reranker to rerank search results`

from pymilvus.model.reranker import CrossEncoderRerankFunction

ce_rf = CrossEncoderRerankFunction(
    model_name="cross-encoder/ms-marco-MiniLM-L-6-v2",  `# Specify the model name.`
    device="cpu" `# Specify the device to use, e.g., 'cpu' or 'cuda:0'`
)

reranked_results = ce_rf(
    query='What event in 1956 marked the official birth of artificial intelligence as a discipline?',
    documents=[
        "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.",
        "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.",
        "In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.",
        "The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."
    ],
    top_k=3
)

`# print the reranked results`
for result in reranked_results:
    print(f'score: {result.score}')
    print(f'doc_text: {result.text}')`

预期输出类似于以下内容:

复制代码
`score: 6.250532627105713
doc_text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.
score: -2.9546022415161133
doc_text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.
score: -4.771512031555176
doc_text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.`
相关推荐
dblens 数据库管理和开发工具1 天前
开源向量数据库比较:Chroma, Milvus, Faiss,Weaviate
数据库·开源·milvus·faiss·chroma·weaviate
玄同7652 天前
数据库全解析:从关系型到向量数据库,LLM 开发中的选型指南
数据库·人工智能·知识图谱·milvus·知识库·向量数据库·rag
自可乐3 天前
Milvus向量数据库/RAG基础设施学习教程
数据库·人工智能·python·milvus
领航猿1号8 天前
Langchain 1.0.2 从入门到精通(含基础、RAG、Milvus、Ollama、MCP、Agents)
langchain·agent·milvus·rag·mcp·langchain 1.0
Knight_AL9 天前
Docker 部署 Milvus 并连接现有 MinIO 对象存储
docker·eureka·milvus
码农阿豪9 天前
基于Milvus与混合检索的云厂商文档智能问答系统:Java SpringBoot全栈实现
java·spring boot·milvus
GeminiJM9 天前
亿级向量检索:Elasticsearch vs. Milvus,性能鸿沟与架构抉择
elasticsearch·架构·milvus
福大大架构师每日一题11 天前
milvus v2.6.9 发布:支持主键搜索、段重开机制、日志性能全面提升!
android·java·milvus
Lkygo14 天前
milvus快速入门(包含图片搜索)
milvus
molaifeng14 天前
告别大模型幻觉:深度解析 RAG 文档切割艺术与 Milvus 高性能实战
milvus·rag