前言

本文对比ElasticSearch和Vearch这两个数据库作为向量检索引擎的性能等指标。

ElasticSearch

目前ES7和ES8都支持向量字段类型，但是ES8支持了KNN Search，有更好的检索性能。

KNN Search支持多种相似性计算方法：www.elastic.co/guide/en/el...

KNN Search使用文档参考：www.elastic.co/guide/en/el...

ES7+ElasticKNN插件

因为ES7本身仅支持了存储向量数据类型，但是对检索并没有做优化，检索采用暴力检索的方式，性能非常差。通过安装加速插件ElasticKNN来优化检索性能。

我本地使用Docker部署了ElasticSearch，安装过程如下：

bash 复制代码

# 本地使用colima来启动docker
# es对于内存有要求，本地测试8G可以正常启动，但是4G回启动失败
colima start --cpu 4 --memory 8

# 进入colima虚拟机
colima ssh
# 在colima虚拟机上运行
sudo sysctl -w vm.max_map_count=262144

# 退出colima虚拟机，在本机运行
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.17.16

docker network create elastic

docker run --name es01 --net elastic -p 9200:9200 -it -m 8GB -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.17.16

# 进入container内部，安装plugin
./elasticsearch-plugin install file:////root/download/elastiknn-7.17.3.0.zip

mapping创建：

json 复制代码

{
    "settings": {
        "index": {
            "number_of_shards": 1,
            "number_of_replicas": 1,
            "elastiknn": true
        }
    },
    "mappings": {
        "properties": {
            "driver": {
                "properties": {
                    "id": {
                        "type": "keyword"
                    },
                    "age": {
                        "type": "short"
                    }
                }
            },
            "img_vector": {
                "type": "elastiknn_dense_float_vector",
                "elastiknn": {
                    "dims": 512,
                    "model": "lsh",
                    "similarity": "cosine",
                    "L": 99,
                    "k": 1
                }
            }
        }
    }
}

ES8

也是用docker来搭建ES，安装过程类似于ES7。

mapping如下：

json 复制代码

{
    "settings": {
        "index": {
            "number_of_shards": 1,
            "number_of_replicas": 1
        }
    },
    "mappings": {
        "properties": {
            "driver": {
                "properties": {
                    "id": {
                        "type": "keyword"
                    },
                    "age": {
                        "type": "short"
                    }
                }
            },
            "img_vector": {
                "type": "dense_vector",
                "dims": 512,
                "similarity": "dot_product",
				"index_options": {
					"type": "hnsw",
					"m": 40,
					"ef_construction": 400
				}
            }
        }
    }
}

ES检索效率测试

本地搭建单节点ES集群，版本为7.17.3/8.11.0。
测试的时候，对比原生的ES向量检索以及通过安装加速插件ElasticKNN的向量检索两者的效率。
测试的数据是通过本地随机生成，数据结构如下，其中向量的维数为512。

测试项目	doc数量	index存储占用空间	查询效率（10次查询取平均）
V7.17.3+单shard+TOP5	100w	2.7GB	8.7s
V7.17.3+3 shards+TOP5	100w	2.7GB	2.7s
V7.17.3+ElasticKNN+单shard+TOP5	100w	3.3GB	400ms
V7.17.3+ElasticKNN+单shard+TOP100	100w	3.3GB	700ms
V7.17.3+ElasticKNN+单shard+TOP100	200w	6.8GB	1.3s
V8.11.0+单shard+TOP100	50w	1.2GB	60ms
V8.11.0+单shard+TOP100`(M=40, ef_construction=400, num_candidates=100)`	50w	1.2GB	100ms
V8.11.0+单shard+TOP100`(M=40, ef_construction=400, num_candidates=200)`	50w	1.2GB	200ms

ES召回率测试

版本：V8.11.0

参数：

建索引时：

M: 40

ef_construction: 400
搜索时：

k：100

num_candidates：200

项目	平均查询耗时	召回率
Top10	18ms	0.994709
Top100	19ms	0.994709

Vearch

Vearch 是一个分布式向量搜索系统，可用来存储、计算海量的特征向量，为 AI 领域的向量检索提供基础系统支撑与保障。该系统能够广泛地应用于图像，音视频和自然语言处理等各个机器学习领域。

Vearch支持的特性：

高可用，高可靠，数据持久化存储。
支持RESTful API
支持多种ANNS（近似最近邻搜索）索引，满足不同场景需求。

Vearch在创建表空间的时候，需要配置索引构建的相关参数，在搜索的时候，针对不同的索引类型有不同的参数配置，这些参数会影响检索的性能、召回率和精度。参数的配置技巧可以参考：

github.com/vearch/vear...

Vearch检索效率测试

测试的数据是通过本地随机生成，数据结构如下，其中向量的维数为512。

测试项目	doc数量	查询效率（10次查询取平均）
单partition+Top100	100w	80ms

ElasticSearch VS Vearch

检索引擎	是否支持全量数据更新	检索性能	使用难度（参数复杂程度）	稳定性
ElasticSearch	是	中	低	高
Vearch	是	高	中	中

向量检索引擎选型

前言