向量数据库技术系列二-Milvus介绍

一、前言

Milvus是一款云原生的向量数据库，它具备高可用、高性能、易拓展的特点，用于海量向量数据的实时召回。

存储，作为数据库，Milvus存储的是向量，包括文本，图像，音频等各类非结构化数据，使用embedding 技术将这些数据转化为向量后，Milvus 会存储这些向量。

索引，索引算法是向量数据库最核心的技术，Milvus支持多种向量索引，包括 IVF、HNSW、DiskANN 等，所有这些算法都经过了深度定制和优化。

检索，除了最基本的相似检索之外，Milvus还支持过滤，范围，标量以及混合检索方式，满足不同的用例检索需求。

架构，Milvus 采用共享存储架构，计算和存储分离，计算节点支持横向扩展。

Milvus向量数据库具有以下的特征：

高性能：性能高超，可对海量数据集进行向量相似度检索。
高可用、高可靠：Milvus 支持在云上扩展，其容灾能力能够保证服务高可用。
混合查询：Milvus 支持在向量相似度检索过程中进行标量字段过滤，实现混合查询。
开发者友好：支持多语言、多工具的 Milvus 生态系统。

二、架构

Milvus的架构包括访问层、协调器服务、工作节点和存储，其架构图如下：

1、接入层（Access Layer）

是系统的门面。由一组无状态 proxy 组成。对外提供用户连接的 endpoint，负责验证客户端请求并合并返回结果。

代理本身是无状态的。它使用 Nginx、Kubernetes Ingress、NodePort 和 LVS 等负载均衡组件提供统一的服务地址。
由于 Milvus 采用的是大规模并行处理（MPP）架构，代理会汇总并后处理中间结果，然后将最终结果返回给客户端。

2、协调服务（Coordinator Service）

系统的大脑，负责分配任务给执行节点。协调服务共有四种角色，分别为 root coord、data coord、query coord 和 index coord。

根协调器(root coord),根协调器负责处理数据定义语言（DDL）和数据控制语言（DCL）请求。比如，创建或删除 collection、partition、index 等，同时负责维护中心授时服务 TSO 和时间窗口的推进。
查询协调器(query coord),负责管理 query node 的拓扑结构和负载均衡以及从增长的 segment 移交切换到密封的 segment。
数据协调器(data coord),负责管理 data node 的拓扑结构，维护数据的元信息以及触发 flush、compact 等后台数据操作。
索引协调器(index coord),负责管理 index node 的拓扑结构，构建索引和维护索引元信息。

3、执行节点(Worker Node)

系统的四肢，负责完成协调服务下发的指令和 proxy 发起的数据操作语言（DML）命令。Milvus采用的计算和存储分离，执行节点是无状态的，可以配合 Kubernetes 快速实现扩缩容和故障恢复。

执行节点分为三种角色，分别为 data node、query node 和 index node。

查询节点(Query node), 通过订阅消息存储（log broker）获取增量日志数据并转化为为不断增长的片段（growing segment），基于对象存储加载历史数据，提供标量 + 向量的混合查询和搜索功能。
数据节点(Data node),通过订阅消息存储log broker）获取增量日志数据，处理更改请求，并将日志数据打包存储在对象存储上实现日志快照持久化。
索引节点(Index node),负责执行索引构建任务。Index node 不需要常驻于内存，可以通过 serverless 的模式实现。

4、存储服务(Storage)

系统的骨骼，负责 Milvus 数据的持久化，分为元数据存储（meta store）、消息存储（log broker）和对象存储（object storage）三个部分。

元数据存储(meta store),负责存储元信息的快照，比如：集合 schema 信息、节点状态信息、消息消费的 checkpoint 等。元信息存储需要极高的可用性、强一致和事务支持，因此，etcd 是这个场景下的不二选择。除此之外，etcd 还承担了服务注册和健康检查的职责。
消息存储(log broker),消息存储是一套支持回放的发布订阅系统，用于持久化流式写入的数据，以及可靠的异步执行查询、事件通知和结果返回。执行节点宕机恢复时，通过回放消息存储保证增量数据的完整性。
对象存储(object storage),负责存储日志的快照文件、标量 / 向量索引文件以及查询的中间处理结果。Milvus 采用 MinIO 作为对象存储，另外也支持部署于 AWS S3 和 Azure Blob 这两大最广泛使用的低成本存储。但是，由于对象存储访问延迟较高，且需要按照查询计费，因此 Milvus 未来计划支持基于内存或 SSD 的缓存池，通过冷热分离的方式提升性能以降低成本。

三、数据库基本操作

1、基本概念

(1)数据组织结构

类似于传统的数据库，milvus对于存储的向量数据也有类似的管理形式。主要有Collection,Entity,Field,可以与关系型数据库类比。

|-----------------|------------|
| milvus向量数据库 | 关系型数据库 |
| Collection | 表 |
| Entity | 行 |
| Field | 表字段 |

下图显示了一个有 8 列和 6 个实体的 Collection

(2)分布式

由于milvus是分布式的，对数据进行分区和分片，已提升查询和写入性能。

Partition

是将数据按照某种规则分成多个部分，每个部分称为一个分区。分区的目的是为了提高查询效率，减少单表数据量过大带来的性能问题。在Milvus中，分区可以按照时间范围、地理位置或其他业务逻辑进行划分。例如，可以将数据按照日期范围进行分区，这样查询特定日期的数据时，只需搜索相应的分区，从而减少查询范围，提高查询速度‌。

Shard

是指将数据写入操作分散到不同节点上，使 Milvus 能充分利用集群的并行计算能力进行写入。默认情况下，单个 Collection 包含 2 个分片（Shard）。目前 Milvus 采用基于主键哈希的分片方式，未来将支持随机分片、自定义分片等更加灵活的分片方式。

(3)索引

milvus支持多种索引，包括向量索引和标量索引。

向量索引，Milvus 支持各种类型的向量数据，包括浮点嵌入（通常称为浮点向量或密集向量）、二进制嵌入（也称为二进制向量）和稀疏嵌入（也称为稀疏向量）。

标量索引，标量索引用于非向量字段值过滤，类似于传统的数据库索引。包括整数、字符串等。

2、创建数据库

可以在本地按照milvus数据库环境，也可以到Zilliz Cloud申请一个免费的集群，本案使用后一种。

申请完成后，就可以通过python连接到云服务上进行创建。

(1)在本地pymilvus安装

python 复制代码

pip3 install pymilvus

(2)创建远程连接的client

python 复制代码

from pymilvus import MilvusClient, DataType

client = MilvusClient(
    uri="xxxxx",
    token="xxxxxx"
)

其中url和token为ziliz Cloud集群的连接信息，在网站查找

(3)创建schema和field

python 复制代码

# 3.1. Create collection schema
schema = MilvusClient.create_schema(
    auto_id=False,
    enable_dynamic_field=True,
)

# 3.2. Add fields to schema
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=5)
schema.add_field(field_name="color", datatype=DataType.VARCHAR, max_length=512)

(4)创建索引

python 复制代码

# 3.3. Prepare index parameters
index_params = client.prepare_index_params()

# 3.4. Add indexes
index_params.add_index(
    field_name="id",
    index_type="STL_SORT"
)

index_params.add_index(
    field_name="vector", 
    index_type="AUTOINDEX",
    metric_type="COSINE"
)

(5)创建数据库

python 复制代码

# 3.5. Create a collection with the index loaded simultaneously
client.create_collection(
    collection_name="my_demo",
    schema=schema,
    index_params=index_params
)

创建完成后，就可以在集群的collection下查找这张表以及schema。

3、插入数据

接下来我们插入数据

python 复制代码

#4.1 build data
data=[
    {"id": 0, "vector": [0.3580376395471989, -0.6023495712049978, 0.18414012509913835, -0.26286205330961354, 0.9029438446296592], "color": "pink_8682"},
    {"id": 1, "vector": [0.19886812562848388, 0.06023560599112088, 0.6976963061752597, 0.2614474506242501, 0.838729485096104], "color": "red_7025"},
    {"id": 2, "vector": [0.43742130801983836, -0.5597502546264526, 0.6457887650909682, 0.7894058910881185, 0.20785793220625592], "color": "orange_6781"},
    {"id": 3, "vector": [0.3172005263489739, 0.9719044792798428, -0.36981146090600725, -0.4860894583077995, 0.95791889146345], "color": "pink_9298"},
    {"id": 4, "vector": [0.4452349528804562, -0.8757026943054742, 0.8220779437047674, 0.46406290649483184, 0.30337481143159106], "color": "red_4794"},
    {"id": 5, "vector": [0.985825131989184, -0.8144651566660419, 0.6299267002202009, 0.1206906911183383, -0.1446277761879955], "color": "yellow_4222"},
    {"id": 6, "vector": [0.8371977790571115, -0.015764369584852833, -0.31062937026679327, -0.562666951622192, -0.8984947637863987], "color": "red_9392"},
    {"id": 7, "vector": [-0.33445148015177995, -0.2567135004164067, 0.8987539745369246, 0.9402995886420709, 0.5378064918413052], "color": "grey_8510"},
    {"id": 8, "vector": [0.39524717779832685, 0.4000257286739164, -0.5890507376891594, -0.8650502298996872, -0.6140360785406336], "color": "white_9381"},
    {"id": 9, "vector": [0.5718280481994695, 0.24070317428066512, -0.3737913482606834, -0.06726932177492717, -0.6980531615588608], "color": "purple_4976"}
]

#4.2 insert data to my_demmo
res = client.insert(
    collection_name="my_demo",
    data=data
)

print(res)
#{'insert_count': 10, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 'cost': 1}

4、检索数据

接下来我们通过向量来检索数据。要求 Milvus 使用余弦（COSINE）计算查询向量与 Collections 中向量的相似度，并返回三个最相似的向量。

python 复制代码

# 5. Single vector search
query_vector = [0.3580376395471989, -0.6023495712049978, 0.18414012509913835, -0.26286205330961354, 0.9029438446296592]
res = client.search(
    collection_name="my_demo",
    anns_field="vector",
    data=[query_vector],
    limit=3,
    search_params={"metric_type": "COSINE"}
)

for hits in res:
    for hit in hits:
        print(hit)

print(res)
#{'id': 0, 'distance': 1.0, 'entity': {}}
#{'id': 1, 'distance': 0.6290165185928345, 'entity': {}}
#{'id': 4, 'distance': 0.5975797176361084, 'entity': {}}

query_vector为检索条件向量，这里使用的就是第一条数据，所以从最后的检索的结果看，第一条也最相似(distance为1)。

除了向量的检索，还支持标量的过滤检索，比如本例中，color为字符型标量。检索包含'red'字符的向量

python 复制代码

# 5. Single vector search
query_vector = [0.3580376395471989, -0.6023495712049978, 0.18414012509913835, -0.26286205330961354, 0.9029438446296592]
res = client.search(
    collection_name="my_demo",
    anns_field="vector",
    data=[query_vector],
    filter='color like "red%"',#增加标量过滤
    limit=3,
    search_params={"metric_type": "COSINE"}
)

for hits in res:
    for hit in hits:
        print(hit)
#{'id': 1, 'distance': 0.6290165185928345, 'entity': {}}
#{'id': 4, 'distance': 0.5975797176361084, 'entity': {}}
#{'id': 6, 'distance': -0.24996185302734375, 'entity': {}}

虽然id=0最相似，但是由于不包含red，被过滤掉。

四、嵌入和重排

向量数据库的主责是向量的存储和检索，对于如何将数据进行向量化(嵌入)以及对于检索后结果的按语义重新排序(重排)，一般不属于向量数据库的主责范畴，但milvus集成了一些工具包和解决方案，供用户选择。

1、嵌入

嵌入是将数据向量化的过程，milvus也集成了多种嵌入模型的包。包括openai，bge-m3等，下面使用milvus默认的嵌入模型，进行向量转化。

首先按照model子包

python 复制代码

pip install "pymilvus[model]"

接下来，使用DefaultEmbeddingFunction 的全-MiniLM-L6-v2句子转换器模型进行转化

python 复制代码

from pymilvus import model

# This will download "all-MiniLM-L6-v2", a light weight model.
ef = model.DefaultEmbeddingFunction()

# Data from which embeddings are to be generated 
docs = [
    "Artificial intelligence was founded as an academic discipline in 1956.",
    "Alan Turing was the first person to conduct substantial research in AI.",
    "Born in Maida Vale, London, Turing was raised in southern England.",
]

embeddings = ef.encode_documents(docs)

# Print embeddings
print("Embeddings:", embeddings)
# Print dimension and shape of embeddings
print("Dim:", ef.dim, embeddings[0].shape)

预期输出的结果如下：

python 复制代码

Embeddings: [array([-3.09392996e-02, -1.80662833e-02,  1.34775648e-02,  2.77156215e-02,
       -4.86349640e-03, -3.12581174e-02, -3.55921760e-02,  5.76934684e-03,
        2.80773244e-03,  1.35783911e-01,  3.59678417e-02,  6.17732145e-02,
...
       -4.61330153e-02, -4.85207550e-02,  3.13997865e-02,  7.82178566e-02,
       -4.75336798e-02,  5.21207601e-02,  9.04406682e-02, -5.36676683e-02],
      dtype=float32)]
Dim: 384 (384,)

2、重排

向量数据库一般通过向量近似近邻（ANN）算法搜索大量潜在相关结果，但最相似的不一定是语义最接近的。重排是通过大模型对检索的结果，按照语义进行重新排序，给出最优解。

下面使用BGE模型进行演示

首先安装模型子包

python 复制代码

pip install pymilvus[model]

创建 Reranker 的实例对文档进行排序

python 复制代码

from pymilvus.model.reranker import BGERerankFunction


bge_rf = BGERerankFunction(
    model_name="BAAI/bge-reranker-v2-m3",  # Specify the model name. Defaults to `BAAI/bge-reranker-v2-m3`.
    device="cpu" # Specify the device to use, e.g., 'cpu' or 'cuda:0'
)

query = "What event in 1956 marked the official birth of artificial intelligence as a discipline?"

documents = [
    "In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.",
    "The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.",
    "In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.",
    "The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."
]

bge_rf(query, documents)

预期的结果如下：

python 复制代码

[RerankResult(text="The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.", score=0.9911615761470803, index=1),
 RerankResult(text="In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.", score=0.0326971950177779, index=0),
 RerankResult(text='The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.', score=0.006514905766152258, index=3),
 RerankResult(text='In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.', score=0.0042116724917325935, index=2)]

五、总结

本文主要针对向量数据milvus的架构，基本操作以及扩展功能进行介绍。

架构上 ，采用了计算和存储分离的分布式架构。接入层 是系统的门面，由一组无状态 proxy 组成；协调服务 系统的大脑，负责分配任务给执行节点；执行节点是系统的四肢，负责完成协调服务下发的指令和 proxy 发起的数据操作语言（DML）命令；存储服务是系统的骨骼，负责 Milvus 数据的持久化。

基本操作，解释了Collection,Entity,Field相关术语，并演示了创建数据库，插入和检索数据。。

扩展功能，对嵌入和重排功能进行了介绍和演示。

附件

向量数据库技术系列五-Weaviate介绍