如何基于DSL脚本进行elasticsearch向量检索示例

向量检索是目前RAG库主要依赖功能,这里示例如何在elasticsearch中构建dense_vector向量库,并示例基于DSL脚本的向量相似度检索过程

1 es向量检索

1.1 DSL检索

es访问dense_vector的DSL方法有cosinessimilarity, dotProduct, 1norm或l2norm函数。

需要注意每个DSL脚本只能调用这些函数一次。不要在循环中使用这些函数来计算文档向量和多个其他向量之间的相似性。

如果需要该功能,可以通过直接访问向量值来重新实现这些函数。

1.2 建立连接

假设elasticsearch环境已经构建

以下时连接elasticsearch的代码示例

复制代码
from elasticsearch import Elasticsearch
es = Elasticsearch(hosts=["http://localhost:9200"])
es.info()

输出如下所示

ObjectApiResponse({'name': '87eb7cfd07bb', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'HE1OJj1aQYmnnRL2nwVNgw', 'version': {'number': '8.11.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '64cf052f3b56b1fd4449f5454cb88aca7e739d9a', 'build_date': '2023-12-08T11:33:53.634979452Z', 'build_snapshot': False, 'lucene_version': '9.8.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

2 虚拟简单示例

这里通过虚构简单的三维向量,示例向量入库和检索过程。

2.1 建立索引

构建文本字段和向量字段,采用虚拟三维向量。

复制代码
index_name = "vec_index_0"
es.indices.create(
        index=index_name,
        mappings = {
            "properties": {
              "my_vector": {
                "type": "dense_vector",
                "dims": 3
              },
              "my_text" : {
                "type" : "keyword"
              }
            }
          }
    )

输出如下

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'vec_index_0'})

2.2 插入数据

数据插入代码如下所示,并虚构3个三维向量。

复制代码
from elasticsearch import Elasticsearch, helpers

texts = [
    "Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.",
    "Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.",
    "Rocky Mountain National Park  is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.",
]
vectors = [
    [0.5, 10, 6],
    [-0.5, 10, 10],
    [-1, 11, 3]
]
print(len(texts), len(vectors))
docs = [ {"my_text": text, "my_vector": vec} for text, vec in zip(texts, vectors)]
bulk_response = helpers.bulk(es, docs, index=index_name)
print(bulk_response)

输出如下

3 3

(3, [])

2.3 向量检索

cosinessimilarity函数计算给定查询向量和文档向量之间的余弦相似性度量。

这里通过DSL脚本应用cosine相似度计算,并加1方式相似度为负。

检索代码如下所示。

复制代码
vec = [-0.5, 10, 6]
print(len(vec))

query = {
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'my_vector')+1.0",
        "params": {
          "queryVector": [-0.5, 10, 6]
        }
      }
    }
  }
}

res = es.search(index=index_name, body=query)
print(res)

输出如下

{'took': 1, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3, 'relation': 'eq'}, 'max_score': 1.9963301, 'hits': [{'_index': 'vec_query_test_v4', '_id': 'erQPgpoBzVHvZ0eub0Bb', '_score': 1.9963301, '_source': {'my_text': 'Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.', 'my_vector': [0.5, 10, 6]}}, {'_index': 'vec_query_test_v4', '_id': 'e7QPgpoBzVHvZ0eub0Bb', '_score': 1.9701604, '_source': {'my_text': 'Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.', 'my_vector': [-0.5, 10, 10]}}, {'_index': 'vec_query_test_v4', '_id': 'fLQPgpoBzVHvZ0eub0Bb', '_score': 1.9618319, '_source': {'my_text': 'Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.', 'my_vector': [-1, 11, 3]}}]}}

3 实际向量示例

3.1 创建向量引擎

这里基于langchain的OllamaEmbeddings创建向量计算引起,并假设ollama和langchain已安装。

创建代码示例如下

复制代码
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="bge-m3")

3.2 建立索引

构建文本字段和向量字段,采用bge-m3的1024维向量。

复制代码
index_name = "vec_index_ollama"
es.indices.create(
        index=index_name,
        mappings = {
            "properties": {
              "my_vector": {
                "type": "dense_vector",
                "dims": 1024
              },
              "my_text" : {
                "type" : "keyword"
              }
            }
          }
    )

输出如下

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'vec_index_ollama'})

3.3 插入数据

数据插入代码如下所示,并虚构3个bge-m3向量。

复制代码
from elasticsearch import Elasticsearch, helpers

texts = [
    "Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.",
    "Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.",
    "Rocky Mountain National Park  is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.",
]
vectors = embeddings.embed_documents(texts)

print(len(texts), len(vectors))
docs = [ {"my_text": text, "my_vector": vec} for text, vec in zip(texts, vectors)]
bulk_response = helpers.bulk(es, docs, index=index_name)
print(bulk_response)

输出如下

3 3

(3, [])

3.4 向量查询

检索代码如下所示,在计算相似度之前,这里需要先用ollama embedding计算查询向量,。

程序代码示例如下。

复制代码
vec = embeddings.embed_query("Rocky Mountain National Park")
print(len(vec))

query = {
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, 'my_vector')+1.0",
        "params": {
          "queryVector": vec
        }
      }
    }
  }
}

res = es.search(index=index_name, body=query)
print(res)

输出如下

1024

{'took': 9, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3, 'relation': 'eq'}, 'max_score': 1.5386453, 'hits': [{'_index': 'vec_index_ollama', '_id': 'grQbgpoBzVHvZ0eubkCf', '_score': 1.5386453, '_source': {'my_text': 'Rocky Mountain National Park is one of the most popular national parks in the United States. It receives over 4.5 million visitors annually, and is known for its mountainous terrain, including Longs Peak, which is the highest peak in the park. The park is home to a variety of wildlife, including elk, mule deer, moose, and bighorn sheep. The park is also home to a variety of ecosystems, including montane, subalpine, and alpine tundra. The park is a popular destination for hiking, camping, and wildlife viewing, and is a UNESCO World Heritage Site.', 'my_vector': [0.030506486, -0.008855448, -0.0542615, 0.032204177, -0.03877517, -0.061205674, ..., 0.012325888, 0.018495426, 0.018682847]}}, {'_index': 'vec_index_ollama', '_id': 'gbQbgpoBzVHvZ0eubkCf', '_score': 1.4467602, '_source': {'my_text': 'Yosemite National Park is a United States National Park, covering over 750,000 acres of land in California. A UNESCO World Heritage Site, the park is best known for its granite cliffs, waterfalls and giant sequoia trees. Yosemite hosts over four million visitors in most years, with a peak of five million visitors in 2016. The park is home to a diverse range of wildlife, including mule deer, black bears, and the endangered Sierra Nevada bighorn sheep. The park has 1,200 square miles of wilderness, and is a popular destination for rock climbers, with over 3,000 feet of vertical granite to climb. Its most famous and cliff is the El Capitan, a 3,000 feet monolith along its tallest face.', 'my_vector': [0.019682772, 0.0011470582, -0.055675734, 0.019353507, 0.0018442876, -0.04981726, -0.0319025, 0.003043613, 0.0074340384, 0.004517093, 0.0070879683, -0.0009796194, ..., 0.036385845, -0.03322887, 0.032527614, 0.047814477]}}, {'_index': 'vec_index_ollama', '_id': 'gLQbgpoBzVHvZ0eubkCf', '_score': 1.4366195, '_source': {'my_text': 'Yellowstone National Park is one of the largest national parks in the United States. It ranges from the Wyoming to Montana and Idaho, and contains an area of 2,219,791 acress across three different states. Its most famous for hosting the geyser Old Faithful and is centered on the Yellowstone Caldera, the largest super volcano on the American continent. Yellowstone is host to hundreds of species of animal, many of which are endangered or threatened. Most notably, it contains free-ranging herds of bison and elk, alongside bears, cougars and wolves. The national park receives over 4.5 million visitors annually and is a UNESCO World Heritage Site.', 'my_vector': [0.04341205, 0.010377382, -0.053659115, 0.034629174, -0.040680032, -0.05273915, -0.021637093, -0.014424741, 0.003394217, 0.005586104, 0.010616635, 0.039135892, -0.021042265, -0.0010980334, ..., -0.0015615255, 0.02610852, 0.015156581, 0.010001613, 0.034023464]}}]}}

reference


向量数据库:使用Elasticsearch实现向量数据存储与搜索

https://zhuanlan.zhihu.com/p/634015951

相关推荐
aigcapi21 小时前
RAG 系统的黑盒测试:从算法对齐视角解析 GEO 优化的技术指标体系
大数据·人工智能·算法
cui17875681 天前
排队免单模式深度拆解:闭环逻辑、裂变内核与落地法则
大数据
热爱专研AI的学妹1 天前
数眼搜索API与博查技术特性深度对比:实时性与数据完整性的核心差异
大数据·开发语言·数据库·人工智能·python
方向研究1 天前
管仲治国
大数据
成长之路5141 天前
【实证分析】数据资产信息披露程度数据集-含原始数据及do代码(2007-2024年)
大数据
Elastic 中国社区官方博客1 天前
Elasticsearch:在 X-mas 吃一些更健康的东西
android·大数据·数据库·人工智能·elasticsearch·搜索引擎·全文检索
消失的旧时光-19431 天前
微服务的本质,其实是操作系统设计思想
java·大数据·微服务
PNP Robotics1 天前
PNP机器人受邀参加英业达具身智能活动
大数据·人工智能·python·学习·机器人
360智汇云1 天前
存储压缩:不是“挤水分”,而是让数据“轻装上阵
大数据·人工智能
码农小白猿1 天前
IACheck优化电梯定期检验报告:自动化术语审核提升合规性与效率
大数据·运维·人工智能·ai·自动化·iacheck