Hnswlib 介绍与入门使用

Hnswlib是一个强大的近邻搜索(ANN)库, 官方介绍 Header-only C++ HNSW implementation with python bindings, insertions and updates. 热门的向量数据库Milvus底层的ANN库之一就是Hnswlib, 为milvus提供HNSW检索。

HNSW 原理

HNSW 原理

将节点划分成不同层级,贪婪地遍历来自上层的元素,直到达到局部最小值,然后切换到下一层,以上一层中的局部最小值作为新元素重新开始遍历,直到遍历完最低一层。

安装使用

从源码安装:

bash 复制代码
apt-get install -y python-setuptools python-pip
git clone https://github.com/nmslib/hnswlib.git
cd hnswlib
pip install .

或者直接pip安装 pip install hnswlib

python 使用

python 复制代码
import hnswlib
import numpy as np

dim = 16
num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))

# We split the data in two batches:
data1 = data[:num_elements // 2]
data2 = data[num_elements // 2:]

# Declaring index
p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip

# Initializing index
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
#
# ef_construction - controls index search speed/build speed tradeoff
#
# M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction

p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)

# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10)

# Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4)

print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")

# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p

# Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim)  # the space can be changed - keeps the data, alters the distance function.

print("\nLoading index from 'first_half.bin'\n")

# Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements)

print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")

依次介绍:

distances

支持三种距离算法, l2, ip内积,以及cos。

Distance parameter Equation
Squared L2 'l2' d = sum((Ai-Bi)^2)
Inner product 'ip' d = 1.0 - sum(Ai*Bi)
Cosine similarity 'cosine' d = 1.0 - sum(AiBi) / sqrt(sum(AiAi) * sum(Bi*Bi))

API

定义 index

python 复制代码
p = hnswlib.Index(space='l2', dim=dim)  # possible options are l2, cosine or ip

space 指定Distance算法,dim是向量的维度。

初始化索引

python 复制代码
p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)
  • max_elements - 最大容量 (capacity),如果插入数据超过容量会报异常,可以动态扩容
  • ef_construction - 平衡索引构建速度和搜索准确率,ef_construction越大,准确率越高但是构建速度越慢。 ef_construction 提高并不能无限增加索引的质量,常见的 ef_constructio n 参数为 128。
  • M - 表示在建表期间每个向量的边数目量,M会影响内存消耗,M越高,内存占用越大,准确率越高,同时构建速度越慢。通常建议设置在 8-32 之间。

添加数据与查询数据

python 复制代码
# Controlling the recall by setting ef:
# higher ef leads to better accuracy, but slower search
p.set_ef(10)

# Set number of threads used during batch search/construction
# By default using all available cores
p.set_num_threads(4)

print("Adding first batch of %d elements" % (len(data1)))
p.add_items(data1)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data1, k=1)
print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")
  • p.set_ef(10):设置搜索时的最大近邻数量(ef),即在构建索引时最多保留多少个近邻。较高的ef值会导致更好的准确率,但搜索速度会变慢。
  • p.set_num_threads(4):设置在批量搜索和构建索引过程中使用的线程数。默认情况下,使用所有可用的核心。
  • p.add_items(data1):将数据添加到索引中。
  • labels, distances = p.knn_query(data1, k=1):对数据中的每个元素进行查询,找到与其最近的邻居,返回邻居的标签和距离。

保持与加载索引

python 复制代码
# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p

# Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim)  # the space can be changed - keeps the data, alters the distance function.

print("\nLoading index from 'first_half.bin'\n")

# Increase the total capacity (max_elements), so that it will handle the new data
p.load_index("first_half.bin", max_elements = num_elements)

print("Adding the second batch of %d elements" % (len(data2)))
p.add_items(data2)

# Query the elements for themselves and measure recall:
labels, distances = p.knn_query(data, k=1)
print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")
  • 通过save_index保存索引
  • 然后load_index重新加载索引,只要未超过max_elements,可以再次add_items

C++使用

官方提供了C++ 例子,创建索引、插入元素、搜索和序列化

cpp 复制代码
#include "../../hnswlib/hnswlib.h"


int main() {
    int dim = 16;               // Dimension of the elements
    int max_elements = 10000;   // Maximum number of elements, should be known beforehand
    int M = 16;                 // Tightly connected with internal dimensionality of the data
                                // strongly affects the memory consumption
    int ef_construction = 200;  // Controls index search speed/build speed tradeoff

    // Initing index
    hnswlib::L2Space space(dim);
    hnswlib::HierarchicalNSW<float>* alg_hnsw = new hnswlib::HierarchicalNSW<float>(&space, max_elements, M, ef_construction);

    // Generate random data
    std::mt19937 rng;
    rng.seed(47);
    std::uniform_real_distribution<> distrib_real;
    float* data = new float[dim * max_elements];
    for (int i = 0; i < dim * max_elements; i++) {
        data[i] = distrib_real(rng);
    }

    // Add data to index
    for (int i = 0; i < max_elements; i++) {
        alg_hnsw->addPoint(data + i * dim, i);
    }

    // Query the elements for themselves and measure recall
    float correct = 0;
    for (int i = 0; i < max_elements; i++) {
        std::priority_queue<std::pair<float, hnswlib::labeltype>> result = alg_hnsw->searchKnn(data + i * dim, 1);
        hnswlib::labeltype label = result.top().second;
        if (label == i) correct++;
    }
    float recall = correct / max_elements;
    std::cout << "Recall: " << recall << "\n";

    // Serialize index
    std::string hnsw_path = "hnsw.bin";
    alg_hnsw->saveIndex(hnsw_path);
    delete alg_hnsw;

    // Deserialize index and check recall
    alg_hnsw = new hnswlib::HierarchicalNSW<float>(&space, hnsw_path);
    correct = 0;
    for (int i = 0; i < max_elements; i++) {
        std::priority_queue<std::pair<float, hnswlib::labeltype>> result = alg_hnsw->searchKnn(data + i * dim, 1);
        hnswlib::labeltype label = result.top().second;
        if (label == i) correct++;
    }
    recall = (float)correct / max_elements;
    std::cout << "Recall of deserialized index: " << recall << "\n";

    delete[] data;
    delete alg_hnsw;
    return 0;
}

Milvus 使用

milvus 通过cgo调用knowhere,knowhere是一个向量检索的抽象封装,集成了FAISS, HNSW等开源ANN库。

knowhere 是直接将hnswlib代码引入,使用hnswlib的代码在
https://github.com/zilliztech/knowhere/blob/main/src/index/hnsw/hnsw.cc

主要是基于hnswlib的C接口,实现HnswIndexNode

cpp 复制代码
namespace knowhere {
class HnswIndexNode : public IndexNode {
 public:
    HnswIndexNode(const int32_t& /*version*/, const Object& object) : index_(nullptr) {
        search_pool_ = ThreadPool::GetGlobalSearchThreadPool();
    }

    Status
    Train(const DataSet& dataset, const Config& cfg) override {
        auto rows = dataset.GetRows();
        auto dim = dataset.GetDim();
        auto hnsw_cfg = static_cast<const HnswConfig&>(cfg);
        hnswlib::SpaceInterface<float>* space = nullptr;
        if (IsMetricType(hnsw_cfg.metric_type.value(), metric::L2)) {
            space = new (std::nothrow) hnswlib::L2Space(dim);
        } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::IP)) {
            space = new (std::nothrow) hnswlib::InnerProductSpace(dim);
        } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::COSINE)) {
            space = new (std::nothrow) hnswlib::CosineSpace(dim);
        } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::HAMMING)) {
            space = new (std::nothrow) hnswlib::HammingSpace(dim);
        } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::JACCARD)) {
            space = new (std::nothrow) hnswlib::JaccardSpace(dim);
        } else {
            LOG_KNOWHERE_WARNING_ << "metric type not support in hnsw: " << hnsw_cfg.metric_type.value();
            return Status::invalid_metric_type;
        }
        auto index = new (std::nothrow)
            hnswlib::HierarchicalNSW<float>(space, rows, hnsw_cfg.M.value(), hnsw_cfg.efConstruction.value());
        if (index == nullptr) {
            LOG_KNOWHERE_WARNING_ << "memory malloc error.";
            return Status::malloc_error;
        }
        if (this->index_) {
            delete this->index_;
            LOG_KNOWHERE_WARNING_ << "index not empty, deleted old index";
        }
        this->index_ = index;
        return Status::success;
    }

    Status
    Add(const DataSet& dataset, const Config& cfg) override {
        
		// ...
     
        std::atomic<uint64_t> counter{0};
        uint64_t one_tenth_row = rows / 10;
        for (int i = 1; i < rows; ++i) {
            futures.emplace_back(build_pool->push([&, idx = i]() {
                index_->addPoint(((const char*)tensor + index_->data_size_ * idx), idx);
                uint64_t added = counter.fetch_add(1);
                if (added % one_tenth_row == 0) {
                    LOG_KNOWHERE_INFO_ << "HNSW build progress: " << (added / one_tenth_row) << "0%";
                }
            }));
        }
        // ...
    }

其他实现

相关推荐
Elastic开源社区17 天前
Elasticsearch 向量检索详解
大数据·elasticsearch·搜索引擎·向量检索
大龄码农有梦想2 个月前
Springboot集成Milvus和Embedding服务,实现向量化检索
spring boot·embedding·milvus·向量检索·spring ai
DashVector3 个月前
如何通过HTTP API更新Doc
数据库·数据仓库·人工智能·http·向量检索
DashVector3 个月前
如何通过HTTP API插入Doc
数据库·人工智能·http·阿里云·向量检索
DashVector3 个月前
如何通过HTTP API插入或更新Doc
大数据·数据库·数据仓库·人工智能·http·数据库架构·向量检索
DashVector3 个月前
如何通过HTTP API检索Doc
数据库·人工智能·http·阿里云·数据库开发·向量检索
阿里云大数据AI技术3 个月前
通过阿里云 Milvus 与 PAI 搭建高效的检索增强对话系统
大数据·阿里云·云计算·milvus·向量检索
DashVector3 个月前
如何通过HTTP API新建Collection
数据库·人工智能·http·阿里云·向量检索
阿里技术4 个月前
HNSW 分布式构建实践
分布式·算法·方案·hnsw·向量检索
吃饭睡觉不准打豆豆4 个月前
从Full-Text Search全文检索到RAG检索增强
大模型·全文检索·智能问答·向量检索·rag