Unable to get expected results using BM25 or any search functions in Weaviate

题意:使用 Weaviate 中的 BM25 或任何搜索函数都无法获得预期结果

问题背景:

I have created a collection in Weaviate, and ingested some documents into the Weaviate database using LlamaIndex. When I used the default search, I found that it was retrieving wrong documents the whole time. After that, I tested BM25 search, and it was giving high scores to other document, despite copying the entire phrase from the expected document.

我已经在 Weaviate 中创建了一个集合,并使用 LlamaIndex 将一些文档导入到 Weaviate 数据库中。当我使用默认搜索时,我发现它一直检索错误的文档。之后,我测试了 BM25 搜索,但它给其他文档打了高分,尽管我复制了整个短语来自预期的那个文档。

Server Setup Information 服务器设置信息

  • Weaviate Server Version: 1.24.10
  • Deployment Method: Docker
  • LlamaIndex Version: 0.10.42

Document Preparation 文档准备

Document of interest: downloaded Article from https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1627185 as PDF and stored locally. I have other 20 documents to be ingested together for retrieval testing.

感兴趣的文档:从https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1627185 下载的文章,已保存为PDF格式并存储在本地。我还有其他20份文档需要一起整合,以进行检索测试。

Python Setup Information Python 配置信息

Imports 导入

复制代码
# Weaviate
import weaviate
from weaviate.classes.config import Configure, VectorDistances, Property, DataType
from weaviate.util import generate_uuid5
from weaviate.query import MetadataQuery

# LlamaIndex
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings, StorageContext
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core.node_parser import SentenceSplitter

Creating an Index with Weaviate Database使用 Weaviate 数据库创建索引

复制代码
# Creating a Weaviate collection
def create_collection(client, collection_name):
  client.collections.create(
    collection_name,
    vectorizer_config=Configure.Vectorizer.text2vec_transformers(),
    vector_index_config=Configure.VectorIndex.hnsw(distance_metric=VectorDistances.COSINE)
    reranker_config=Configure.Reranker.transformers(),
    inverted_index_config=Configure.inverted_index(
      bm25_b=0.7,
      bm25_k1=1.25,
      index_null_state=True,
      index_property_length=True,
      index_timestamps=True
    ),
  )
 
# Create index using LlamaIndex
def create_weaviate_index(client, index_name, doc_folder):
  create_collection(client, index_name)
  vector_store = WeaviateVectorStore(weaviate_client=client, index_name=index_name, text_key="content")
  storage_context = StorageContext.from_defaults(vector_store=vector_store)
  index = VectorStoreIndex.from_documents([], storage_context=storage_context)
  documents = SimpleDirectoryReader(input_dir=doc_folder)
  nodes = node_parser.get_nodes_from_documents(documents)
  index.insert_nodes(nodes)
  return index

client = weaviate.connect_to_local()
index_name = "LlamaIndex"
doc_folder = "/path/to/doc_folder"
create_weaviate_index(client, index_name, doc_folder)

Querying with documents

Using LlamaIndex

复制代码
query_engine = index.as_query_engine()
question = "EMA was created in 2001 to?" # Took partial string from document
response = query_engine.query(question)
print(response)

for node in response.source_nodes:
  print(node.metadata) # Did not retrieve the document that I copied the string from

Using Weaviate hybrid search, alpha set to 0

复制代码
collection = client.collections.get("LlamaIndex")
question = "EMA was created in 2001 to?" # Took partial string 
query_vector = embed_model.get_query_embedding(question)

response = collection.query.hybrid(
  query=question,
  vector=query_vector
  limit=5,
  alpha=0,

  return_metadata=MetadataQuery(
    distance=True,
    certainty=True,
    score=True,
    explain_score=True
  )
)

for obj in response.objects:
  print(f"METADATA: {obj.metadata}") # Did not retrieve the document that I copied the string from

Using Weaviate bm25 search

复制代码
collection = client.collections.get("LlamaIndex")
question = "EMA was created in 2001 to?" # Took partial string 
response = collection.query.bm25(
  query=question,
  limit=5,

  return_metadata=MetadataQuery(
    distance=True,
    certainty=True,
    score=True,
    explain_score=True
  )
)

for obj in response.objects:
  print(f"METADATA: {obj.metadata}") # Did not retrieve the document that I copied the string from

Using Weaviate near_text search

复制代码
collection = client.collections.get("LlamaIndex")
question = "EMA was created in 2001 to?" # Took partial string 
response = collection.query.near_text(
  query=question,
  limit=5,

  return_metadata=MetadataQuery(
    distance=True,
    certainty=True,
    score=True,
    explain_score=True
  )
)

for obj in response.objects:
  print(f"METADATA: {obj.metadata}") # Did not retrieve the document that I copied the string from

问题解决:

I have put together some code based on yours that maybe can help you.

我基于你的代码整理了一些可能对你有帮助的代码。

I am not sure what vectorizer you are using. This example will use OpenAi.

我不确定你正在使用什么向量器。这个示例将使用 OpenAI。

We have some recipes on ollama here:

我们在 ollama 这里有一些食谱:

recipes/integrations/llm-frameworks/llamaindex at main · weaviate/recipes · GitHub

ps: I have used two pdfs files located here, but you can have any pdfs under the pdfs folder that it should also work:

注:我使用了两个位于这里的PDF文件,但你可以在pdfs文件夹下放置任何PDF文件,它也应该能正常工作:

复制代码
#!pip3 install -U weaviate-client llama_index llama-index-readers-file llama-index-embeddings-openai

# Weaviate
import weaviate
from weaviate.classes.config import Configure, VectorDistances, Property, DataType
from weaviate.util import generate_uuid5
from weaviate.classes.query import MetadataQuery

# LlamaIndex
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings, StorageContext
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core.node_parser import SentenceSplitter

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings

import os
import openai

#os.environ["OPENAI_API_KEY"] = ""
openai.api_key = os.environ["OPENAI_API_KEY"]

embed_model = OpenAIEmbedding(embed_batch_size=10)
Settings.embed_model = embed_model


# lets test out llamaindex embedd model
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model="text-embedding-3-small")

embeddings = embed_model.get_text_embedding(
    "Open AI new Embeddings models is great."
)

print(embeddings[:5])


# Creating a Weaviate collection
def create_collection(client, collection_name):
    client.collections.create(
        collection_name,
        generative_config=Configure.Generative.openai(),
        vectorizer_config=Configure.Vectorizer.text2vec_openai(model="text-embedding-3-small"),
        vector_index_config=Configure.VectorIndex.hnsw(distance_metric=VectorDistances.COSINE),
        reranker_config=Configure.Reranker.transformers(),
        inverted_index_config=Configure.inverted_index(
        bm25_b=0.7,
        bm25_k1=1.25,
        index_null_state=True,
        index_property_length=True,
        index_timestamps=True
        ),
    )
 
# Create index using LlamaIndex
def create_weaviate_index(client, index_name, doc_folder):
    create_collection(client, index_name)
    vector_store = WeaviateVectorStore(weaviate_client=client, index_name=index_name, text_key="content")
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    index = VectorStoreIndex.from_documents([], storage_context=storage_context)
    documents = SimpleDirectoryReader(input_dir=doc_folder).load_data()
    node_parser = SentenceSplitter(chunk_size=1024, chunk_overlap=20)


    nodes = node_parser.get_nodes_from_documents(
        documents, show_progress=False
    )    
    index.insert_nodes(nodes)
    return index

client = weaviate.connect_to_local()
index_name = "LlamaIndex"
#
# WARNING THIS WILL DELETE IF EXISTS
#
client.collections.delete(index_name)
doc_folder = "./pdfs"
create_weaviate_index(client, index_name, doc_folder)

# querying
collection = client.collections.get("LlamaIndex")
collection.query.fetch_objects(include_vector=True, limit=1).objects[0].vector

# querying Weaviate directly
collections = client.collections.get("LlamaIndex")
for object in collections.query.bm25("food").objects:
    print(object.properties)

vector_store = WeaviateVectorStore(weaviate_client=client, index_name=index_name, text_key="content")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents([], storage_context=storage_context)
query_engine = index.as_query_engine()

#filtering
from llama_index.core.vector_stores import (
    MetadataFilter,
    MetadataFilters,
    FilterOperator,
)

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="file_name", operator=FilterOperator.EQ, value="brazil"),
    ]
)

retriever = index.as_retriever(filters=filters)
retriever.retrieve("What is the traditional food of this country?")

# generating an answer
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
from IPython.display import Markdown, display
filters = MetadataFilters(
    filters=[ExactMatchFilter(key="file_name", value="netherlands")]
)
query_engine = index.as_query_engine(filters=filters)
response = query_engine.query("What is the food of this country?")
print("{response}")
相关推荐
Chasing__Dreams2 分钟前
python--杂识--18.1--pandas数据插入sqlite并进行查询
python·sqlite·pandas
彭泽布衣1 小时前
python2.7/lib-dynload/_ssl.so: undefined symbol: sk_pop_free
python·sk_pop_free
强哥之神1 小时前
Meta AI 推出 Multi - SpatialMLLM:借助多模态大语言模型实现多帧空间理解
人工智能·深度学习·计算机视觉·语言模型·自然语言处理·llama
喜欢吃豆1 小时前
从零构建MCP服务器:FastMCP实战指南
运维·服务器·人工智能·python·大模型·mcp
一个处女座的测试2 小时前
Python语言+pytest框架+allure报告+log日志+yaml文件+mysql断言实现接口自动化框架
python·mysql·pytest
nananaij2 小时前
【Python基础入门 re模块实现正则表达式操作】
开发语言·python·正则表达式
蛋仔聊测试2 小时前
Playwright 网络流量监控与修改指南
python
nightunderblackcat3 小时前
进阶向:Python音频录制与分析系统详解,从原理到实践
开发语言·python·音视频
慕婉03073 小时前
Tensor自动微分
人工智能·pytorch·python
MUTA️3 小时前
pycharm中本地Docker添加解释器
ide·python·pycharm