SentenceTransformers×Milvus:如何进行向量相似性搜索

你可曾经历过在 Meetup 上听到一段绝妙的内容，但发现回忆不起具体的细节？作为一名积极组织和参与 Meetup 的开发者关系工程师，我常常会有这种感受。

为了解决这个问题，我开始探索使用相似性搜索技术来筛选大量的非结构化数据。非结构化数据占全世界数据的80%，可以通过不同的 ML 模型转换为向量。本文中，我选择的工具是 Milvus------一款流行的开源向量数据库，擅长管理并搜索复杂的数据。Milvus 能够帮助我们发现数据之间潜在的联系和比较数据相似性。

本文将使用 SentenceTransformers 将非结构化数据转换为 Embedding 向量。SentenceTransformers 是一个能够将句子、文本和图像转换为 Embedding 向量的 Python 框架。您可以用它来编码超过100种语言的句子或文本。然后，我们就可以通过相似度类型（例如：余弦距离）来比较这些 Embedding 向量，从而找到相似含义的句子。

01.

下载数据

Meetup.com 不提供免费的 public API。你需要购买 Pro 版本才可使用其 API。

本文中我自己生成了一些 Meetup 数据。您可以在 GitHub 上获取这些数据*（https://github.com/stephen37/similarity_search_mlops/blob/abc1d91878320911f069fcb8a2949b0d7d592370/data/data_meetup.csv）*，并使用 Pandas 加载数据。

go 复制代码

import pandas as pd
df = pd.read_csv('data/data_meetup.csv')

02.

技术栈：Milvus 和 SentenceTransformers

我们将使用 Milvus 作为向量数据库，SentenceTransformers 用于生成文本向量，OpenAI GPT 3.5-turbo 用于总结 Meetup 内容。由于 Meetup 通常包含很多内容，所以我们需要通过总结来简化数据。

2.1 Milvus Lite

Milvus 提供了不同的部署选项以满足不同的需求。对于追求快速设置的轻量级应用，Milvus Lite 是理想的选择。它可以通过 pip install pymilvus 轻松安装，并直接在 Jupyter 笔记本中运行。

2.2 使用 Docker/Docker Compose 的 Milvus

对于追求稳定性的应用而言，可以使用 Docker Compose 部署分布式架构的 Milvus。您可以在文档*（https://milvus.io/docs/install_standalone-docker-compose.md）* 和 GitHub 页面*（https://github.com/milvus-io/milvus）* 上获取 Docker Compose 文件。当您通过 Docker Compose 启动 Milvus 时，您将看到三个容器，并通过默认的 19530 端口连接到 Milvus。

2.3 SentenceTransformers

SentenceTransformers 用于创建 Embedding 向量。可以在 PyPi 上通过 pip install sentence-transformers 安装。我们将使用 all-MiniLM-L6-v2 模型，因为它与 SentenceTransformers 提供的最佳模型相比小 5 倍，速度快 5 倍，同时仍然提供与最佳模型相当的性能。

03.

进行相似性搜索

3.1 启动 Milvus

为了进行相似性搜索，我们需要一个向量数据库。通过Docker即可快速启动 Milvus 向量数据库*（https://milvus.io/docs/install_standalone-docker.md）*。

3.2 将数据导入 Milvus

在导入数据前，我们需要先创建 1 个 Collection 并设置 Schema。首先设置参数，包括字段 Schema、Collection Schema 和 Collection 名称。

go 复制代码

from pymilvus import FieldSchema, CollectionSchema, DataType, Collection

# We insert the object in the format of title, date, content, content embedding
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=500),
    FieldSchema(name="date", dtype=DataType.VARCHAR, max_length=100),
    FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=10000),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=384)
]
schema = CollectionSchema(fields=fields)
collection = Collection(name="mlops_meetups", schema=schema)

完成创建 Collection 和 Schema 后，现在让我们来针对 embedding 字段创建索引，然后通过 load() 将数据加载到内存。

go 复制代码

collection.create_index(field_name="embedding")
collection.load()

3.3 使用 SentenceTransformer 生成 Embedding 向量

正如之前所述，我们将使用 SentenceTransformer 以及 all-MiniLM-L6-v2 模型来生成 Embedding 向量。首先，让我们导入所需的工具。

go 复制代码

from sentence_transformers import SentenceTransformer

transformer = SentenceTransformer('all-MiniLM-L6-v2')
content_detail = df['content']
content_detail = content_detail.tolist()
embeddings = [transformer.encode(c) for c in content_detail]

# Create an embedding column in our Dataframe
df['embedding'] = embeddings

# Insert the data in the collection
collection.insert(data=df)

3.4 总结 Meetup 内容

Meetup 的内容十分丰富，还会包含日程安排、活动赞助商以及场地/活动的特定规则等信息。这些信息对于参加 Meetup 来说非常重要，但对于我们本文的用例来说并不相关。我们将使用 OpenAI GPT-3.5-turbo 对 Meetup 内容进行总结。

go 复制代码

def summarise_meetup_content(content: str) -> str: 
    response = openai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
          "role": "system",
          "content": "Summarize content you are provided with."
        },
        {
          "role": "user",
          "content": f"{content}"
        }
    ],
        temperature=0,
        max_tokens=1024,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    summary = response.choices[0].message.content
    return summary

3.5 返回相似内容

在进行相似性搜索前，需要确保向量数据库能够理解我们的查询。因此，我们需要为查询创建 Embedding 向量。

go 复制代码

search_terms = "The speaker speaks about Open Source and ML Platform"
search_data = [transformer.encode(search_terms)] # Must be a list.

3.5.1 在 Milvus Collection 中搜索相似内容

go 复制代码

res = collection.search(
    data=search_data,  # Embedded search value
    anns_field="embedding",  # Search across embeddings
    param={"metric_type": "IP"},
    limit = 3,  # Limit to top_k results per search
    output_fields=["title", "content"]  # Include title field in result
)

go 复制代码

for hits_i, hits in enumerate(res):
    print("Search Terms:", search_terms)
    print("Results:")
    for hit in hits:
        content_test = hit.entity.get("content")
        print(hit.entity.get("title"), "----", hit.distance)
        print(f'{summarise_meetup_content(hit.entity.get("content"))} \n')

3.5.2 结果

go 复制代码

Search terms: The speaker speaks about Open Source and ML Platform
Results:
First MLOps.community Berlin Meetup ---- 0.5537542700767517
The MLOps.community meetup in Berlin on June 30th will feature a main talk by Stephen Batifol from Wolt on Scaling Open-Source Machine Learning. The event will also include lightning talks, networking, and food and drinks. The agenda includes opening doors at 6:00 pm, Stephen's talk at 7:00 pm, lightning talks at 7:50 pm, and socializing at 8:15 pm. Attendees can sign up for lightning talks on Meetup.com. The event is in collaboration with neptune.ai. 
MLOps.community Berlin 04: Pre-event Women+ In Data and AI Festival ---- 0.4623506963253021
The MLOps.community Berlin is hosting a special edition event on June 29th and 30th at Thoughtworks. The event is a warm-up for the Women+ In Data and AI festival. The meetup will feature speakers Fiona Coath discussing surveillance capitalism and Magdalena Stenius talking about the carbon footprint of machine learning. The agenda includes talks, lightning talks, and networking opportunities. Attendees are encouraged to review and abide by the event's Code of Conduct for an inclusive and respectful environment. 
MLOps.community Berlin Meetup 02 ---- 0.41342616081237793
The MLOps.community meetup in Berlin on October 6th will feature a main talk by Lina Weichbrodt on ML Monitoring, lightning talks, and networking opportunities. The event will be held at Wolt's office with a capacity limit of 150 people. Lina has extensive experience in developing scalable machine learning models and has worked at companies like Zalando and DKB. The agenda includes food, a bonding activity, the main talk, lightning talks, and socializing. Attendees can also sign up to give lightning talks on various MLOps-related topics. The event is in collaboration with neptune.ai.

请前往 GitHub 页面查看代码*（https://github.com/stephen37/similarity_search_mlops/*）。

作者介绍

Stephen Batifol

Developer Advocate at Zilliz

推荐阅读